* [PATCH v6 0/5] fuse: remove temp page copies in writeback @ 2024-11-22 23:23 Joanne Koong 2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong ` (6 more replies) 0 siblings, 7 replies; 124+ messages in thread From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw) To: miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team The purpose of this patchset is to help make writeback-cache write performance in FUSE filesystems as fast as possible. In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: support writable mmap"))), a temp page is allocated for every dirty page to be written back, the contents of the dirty page are copied over to the temp page, and the temp page gets handed to the server to write back. This is done so that writeback may be immediately cleared on the dirty page, and this in turn is done for two reasons: a) in order to mitigate the following deadlock scenario that may arise if reclaim waits on writeback on the dirty page to complete (more details can be found in this thread [1]): * single-threaded FUSE server is in the middle of handling a request that needs a memory allocation * memory allocation triggers direct reclaim * direct reclaim waits on a folio under writeback * the FUSE server can't write back the folio since it's stuck in direct reclaim b) in order to unblock internal (eg sync, page compaction) waits on writeback without needing the server to complete writing back to disk, which may take an indeterminate amount of time. Allocating and copying dirty pages to temp pages is the biggest performance bottleneck for FUSE writeback. This patchset aims to get rid of the temp page altogether (which will also allow us to get rid of the internal FUSE rb tree that is needed to keep track of writeback status on the temp pages). Benchmarks show approximately a 20% improvement in throughput for 4k block-size writes and a 45% improvement for 1M block-size writes. With removing the temp page, writeback state is now only cleared on the dirty page after the server has written it back to disk. This may take an indeterminate amount of time. As well, there is also the possibility of malicious or well-intentioned but buggy servers where writeback may in the worst case scenario, never complete. This means that any folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to be carefully audited. In particular, these are the cases that need to be accounted for: * potentially deadlocking in reclaim, as mentioned above * potentially stalling sync(2) * potentially stalling page migration / compaction This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which filesystems may set on its inode mappings to indicate that writeback operations may take an indeterminate amount of time to complete. FUSE will set this flag on its mappings. This patchset adds checks to the critical parts of reclaim, sync, and page migration logic where writeback may be waited on. Please note the following: * For sync(2), waiting on writeback will be skipped for FUSE, but this has no effect on existing behavior. Dirty FUSE pages are already not guaranteed to be written to disk by the time sync(2) returns (eg writeback is cleared on the dirty page but the server may not have written out the temp page to disk yet). If the caller wishes to ensure the data has actually been synced to disk, they should use fsync(2)/fdatasync(2) instead. * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be waited on when in writeback. There are some cases where the wait is desirable. For example, for the sync_file_range() syscall, it is fine to wait on the writeback since the caller passes in a fd for the operation. [1] https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/ Changelog --------- v5: https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/ Changes from v5 -> v6: * Add Shakeel and Jingbo's reviewed-bys * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo) * Embed fuse_writepage_finish_stat() logic inline (Jingbo) * Remove node_stat NR_WRITEBACK inc/sub (Jingbo) v4: https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/ Changes from v4 -> v5: * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel) * Drop memory hotplug patch (David and Shakeel) * Remove some more kunnecessary writeback waits in fuse code (Jingbo) * Make commit message for reclaim patch more concise - drop part about deadlock and just focus on how it may stall waits v3: https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/ Changes from v3 -> v4: * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in readahead v2: https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/ Changes from v2 -> v3: * Account for sync and page migration cases as well (Miklos) * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache is enabled v1: https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t Changes from v1 -> v2: * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel) * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel) Joanne Koong (5): mm: add AS_WRITEBACK_INDETERMINATE mapping flag mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings fuse: remove tmp folio for writebacks and internal rb tree fs/fs-writeback.c | 3 + fs/fuse/file.c | 360 ++++------------------------------------ fs/fuse/fuse_i.h | 3 - include/linux/pagemap.h | 11 ++ mm/migrate.c | 5 +- mm/vmscan.c | 10 +- 6 files changed, 53 insertions(+), 339 deletions(-) -- 2.43.5 ^ permalink raw reply [flat|nested] 124+ messages in thread
* [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong @ 2024-11-22 23:23 ` Joanne Koong 2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong ` (5 subsequent siblings) 6 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw) To: miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may set to indicate that writing back to disk may take an indeterminate amount of time to complete. Extra caution should be taken when waiting on writeback for folios belonging to mappings where this flag is set. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> --- include/linux/pagemap.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 68a5f1ff3301..fcf7d4dd7e2b 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -210,6 +210,7 @@ enum mapping_flags { AS_STABLE_WRITES = 7, /* must wait for writeback before modifying folio contents */ AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */ + AS_WRITEBACK_INDETERMINATE = 9, /* Use caution when waiting on writeback */ /* Bits 16-25 are used for FOLIO_ORDER */ AS_FOLIO_ORDER_BITS = 5, AS_FOLIO_ORDER_MIN = 16, @@ -335,6 +336,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping) return test_bit(AS_INACCESSIBLE, &mapping->flags); } +static inline void mapping_set_writeback_indeterminate(struct address_space *mapping) +{ + set_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags); +} + +static inline bool mapping_writeback_indeterminate(struct address_space *mapping) +{ + return test_bit(AS_WRITEBACK_INDETERMINATE, &mapping->flags); +} + static inline gfp_t mapping_gfp_mask(struct address_space * mapping) { return mapping->gfp_mask; -- 2.43.5 ^ permalink raw reply related [flat|nested] 124+ messages in thread
* [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong 2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong @ 2024-11-22 23:23 ` Joanne Koong 2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong ` (4 subsequent siblings) 6 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw) To: miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team Currently in shrink_folio_list(), reclaim for folios under writeback falls into 3 different cases: 1) Reclaim is encountering an excessive number of folios under writeback and this folio has both the writeback and reclaim flags set 2) Dirty throttling is enabled (this happens if reclaim through cgroup is not enabled, if reclaim through cgroupv2 memcg is enabled, or if reclaim is on the root cgroup), or if the folio is not marked for immediate reclaim, or if the caller does not have __GFP_FS (or __GFP_IO if it's going to swap) set 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag set and the caller did not have __GFP_FS (or __GFP_IO if swap) set In cases 1) and 2), we activate the folio and skip reclaiming it while in case 3), we wait for writeback to finish on the folio and then try to reclaim the folio again. In case 3, we wait on writeback because cgroupv1 does not have dirty folio throttling, as such this is a mitigation against the case where there are too many folios in writeback with nothing else to reclaim. For filesystems where writeback may take an indeterminate amount of time to write to disk, this has the possibility of stalling reclaim. In this commit, if legacy memcg encounters a folio with the reclaim flag set (eg case 3) and the folio belongs to a mapping that has the AS_WRITEBACK_INDETERMINATE flag set, the folio will be activated and skip reclaim (eg default to behavior in case 2) instead. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> --- mm/vmscan.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 749cdc110c74..37ce6b6dac06 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1129,8 +1129,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, * 2) Global or new memcg reclaim encounters a folio that is * not marked for immediate reclaim, or the caller does not * have __GFP_FS (or __GFP_IO if it's simply going to swap, - * not to fs). In this case mark the folio for immediate - * reclaim and continue scanning. + * not to fs), or the writeback may take an indeterminate + * amount of time to complete. In this case mark the folio + * for immediate reclaim and continue scanning. * * Require may_enter_fs() because we would wait on fs, which * may not have submitted I/O yet. And the loop driver might @@ -1155,6 +1156,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, * takes to write them to disk. */ if (folio_test_writeback(folio)) { + mapping = folio_mapping(folio); + /* Case 1 above */ if (current_is_kswapd() && folio_test_reclaim(folio) && @@ -1165,7 +1168,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* Case 2 above */ } else if (writeback_throttling_sane(sc) || !folio_test_reclaim(folio) || - !may_enter_fs(folio, sc->gfp_mask)) { + !may_enter_fs(folio, sc->gfp_mask) || + (mapping && mapping_writeback_indeterminate(mapping))) { /* * This is slightly racy - * folio_end_writeback() might have -- 2.43.5 ^ permalink raw reply related [flat|nested] 124+ messages in thread
* [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong 2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong 2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong @ 2024-11-22 23:23 ` Joanne Koong 2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong ` (3 subsequent siblings) 6 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw) To: miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team For filesystems with the AS_WRITEBACK_INDETERMINATE flag set, writeback operations may take an indeterminate time to complete. For example, writing data back to disk in FUSE filesystems depends on the userspace server successfully completing writeback. In this commit, wait_sb_inodes() skips waiting on writeback if the inode's mapping has AS_WRITEBACK_INDETERMINATE set, else sync(2) may take an indeterminate amount of time to complete. If the caller wishes to ensure the data for a mapping with the AS_WRITEBACK_INDETERMINATE flag set has actually been written back to disk, they should use fsync(2)/fdatasync(2) instead. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> --- fs/fs-writeback.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d8bec3c1bb1f..ad192db17ce4 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2659,6 +2659,9 @@ static void wait_sb_inodes(struct super_block *sb) if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) continue; + if (mapping_writeback_indeterminate(mapping)) + continue; + spin_unlock_irq(&sb->s_inode_wblist_lock); spin_lock(&inode->i_lock); -- 2.43.5 ^ permalink raw reply related [flat|nested] 124+ messages in thread
* [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong ` (2 preceding siblings ...) 2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong @ 2024-11-22 23:23 ` Joanne Koong 2024-12-19 13:05 ` David Hildenbrand 2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong ` (2 subsequent siblings) 6 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw) To: miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team For migrations called in MIGRATE_SYNC mode, skip migrating the folio if it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the writeback may take an indeterminate amount of time to complete, and waits may get stuck. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> --- mm/migrate.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index df91248755e4..fe73284e5246 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, */ switch (mode) { case MIGRATE_SYNC: - break; + if (!src->mapping || + !mapping_writeback_indeterminate(src->mapping)) + break; + fallthrough; default: rc = -EBUSY; goto out; -- 2.43.5 ^ permalink raw reply related [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong @ 2024-12-19 13:05 ` David Hildenbrand 2024-12-19 14:19 ` Zi Yan ` (2 more replies) 0 siblings, 3 replies; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 13:05 UTC (permalink / raw) To: Joanne Koong, miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On 23.11.24 00:23, Joanne Koong wrote: > For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > writeback may take an indeterminate amount of time to complete, and > waits may get stuck. > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > --- > mm/migrate.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index df91248755e4..fe73284e5246 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > */ > switch (mode) { > case MIGRATE_SYNC: > - break; > + if (!src->mapping || > + !mapping_writeback_indeterminate(src->mapping)) > + break; > + fallthrough; > default: > rc = -EBUSY; > goto out; Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! That sounds very bad. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 13:05 ` David Hildenbrand @ 2024-12-19 14:19 ` Zi Yan 2024-12-19 15:08 ` Zi Yan 2024-12-19 15:43 ` Shakeel Butt 2025-04-02 21:34 ` Joanne Koong 2 siblings, 1 reply; 124+ messages in thread From: Zi Yan @ 2024-12-19 14:19 UTC (permalink / raw) To: David Hildenbrand, Joanne Koong Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19 Dec 2024, at 8:05, David Hildenbrand wrote: > On 23.11.24 00:23, Joanne Koong wrote: >> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >> writeback may take an indeterminate amount of time to complete, and >> waits may get stuck. >> >> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >> --- >> mm/migrate.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index df91248755e4..fe73284e5246 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >> */ >> switch (mode) { >> case MIGRATE_SYNC: >> - break; >> + if (!src->mapping || >> + !mapping_writeback_indeterminate(src->mapping)) >> + break; >> + fallthrough; >> default: >> rc = -EBUSY; >> goto out; > > Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! > > That sounds very bad. Yeah, these writeback folios become unmovable. It makes memory fragmentation unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since it is essentially a forever pin to writeback folios. Why not introduce a retry and timeout mechanism instead of waiting for the writeback forever? -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 14:19 ` Zi Yan @ 2024-12-19 15:08 ` Zi Yan 2024-12-19 15:39 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Zi Yan @ 2024-12-19 15:08 UTC (permalink / raw) To: David Hildenbrand, Joanne Koong Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19 Dec 2024, at 9:19, Zi Yan wrote: > On 19 Dec 2024, at 8:05, David Hildenbrand wrote: > >> On 23.11.24 00:23, Joanne Koong wrote: >>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>> writeback may take an indeterminate amount of time to complete, and >>> waits may get stuck. >>> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>> --- >>> mm/migrate.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index df91248755e4..fe73284e5246 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>> */ >>> switch (mode) { >>> case MIGRATE_SYNC: >>> - break; >>> + if (!src->mapping || >>> + !mapping_writeback_indeterminate(src->mapping)) >>> + break; >>> + fallthrough; >>> default: >>> rc = -EBUSY; >>> goto out; >> >> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >> >> That sounds very bad. > > Yeah, these writeback folios become unmovable. It makes memory fragmentation > unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since > it is essentially a forever pin to writeback folios. Why not introduce a > retry and timeout mechanism instead of waiting for the writeback forever? If there is no way around such indeterminate writebacks, to avoid fragment memory, these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE. -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:08 ` Zi Yan @ 2024-12-19 15:39 ` David Hildenbrand 2024-12-19 15:47 ` Zi Yan 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 15:39 UTC (permalink / raw) To: Zi Yan, Joanne Koong Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19.12.24 16:08, Zi Yan wrote: > On 19 Dec 2024, at 9:19, Zi Yan wrote: > >> On 19 Dec 2024, at 8:05, David Hildenbrand wrote: >> >>> On 23.11.24 00:23, Joanne Koong wrote: >>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>> writeback may take an indeterminate amount of time to complete, and >>>> waits may get stuck. >>>> >>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>> --- >>>> mm/migrate.c | 5 ++++- >>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>> index df91248755e4..fe73284e5246 100644 >>>> --- a/mm/migrate.c >>>> +++ b/mm/migrate.c >>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>> */ >>>> switch (mode) { >>>> case MIGRATE_SYNC: >>>> - break; >>>> + if (!src->mapping || >>>> + !mapping_writeback_indeterminate(src->mapping)) >>>> + break; >>>> + fallthrough; >>>> default: >>>> rc = -EBUSY; >>>> goto out; >>> >>> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>> >>> That sounds very bad. >> >> Yeah, these writeback folios become unmovable. It makes memory fragmentation >> unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since >> it is essentially a forever pin to writeback folios. Why not introduce a >> retry and timeout mechanism instead of waiting for the writeback forever? > > If there is no way around such indeterminate writebacks, to avoid fragment memory, > these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE. But at what point? We surely don't want to make fuse consume only effectively-unmovable memory. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:39 ` David Hildenbrand @ 2024-12-19 15:47 ` Zi Yan 2024-12-19 15:50 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Zi Yan @ 2024-12-19 15:47 UTC (permalink / raw) To: David Hildenbrand Cc: Joanne Koong, miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19 Dec 2024, at 10:39, David Hildenbrand wrote: > On 19.12.24 16:08, Zi Yan wrote: >> On 19 Dec 2024, at 9:19, Zi Yan wrote: >> >>> On 19 Dec 2024, at 8:05, David Hildenbrand wrote: >>> >>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>> writeback may take an indeterminate amount of time to complete, and >>>>> waits may get stuck. >>>>> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>> --- >>>>> mm/migrate.c | 5 ++++- >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>> index df91248755e4..fe73284e5246 100644 >>>>> --- a/mm/migrate.c >>>>> +++ b/mm/migrate.c >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>> */ >>>>> switch (mode) { >>>>> case MIGRATE_SYNC: >>>>> - break; >>>>> + if (!src->mapping || >>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>> + break; >>>>> + fallthrough; >>>>> default: >>>>> rc = -EBUSY; >>>>> goto out; >>>> >>>> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>> >>>> That sounds very bad. >>> >>> Yeah, these writeback folios become unmovable. It makes memory fragmentation >>> unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since >>> it is essentially a forever pin to writeback folios. Why not introduce a >>> retry and timeout mechanism instead of waiting for the writeback forever? >> >> If there is no way around such indeterminate writebacks, to avoid fragment memory, >> these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE. > > But at what point? Before each writeback. And there should be a limit on the amount of unmovable pages they can allocate. > > We surely don't want to make fuse consume only effectively-unmovable memory. Yes, that is undesirable, but the folio under writeback cannot be migrated, since migration needs to wait until its finish. Of course, the right way is to make writeback interruptible, so that migration can continue, but that routine might take a lot of effort I suppose. I admit my proposal is more like a bandaid to minimize the memory fragmentation issue. -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:47 ` Zi Yan @ 2024-12-19 15:50 ` David Hildenbrand 0 siblings, 0 replies; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 15:50 UTC (permalink / raw) To: Zi Yan Cc: Joanne Koong, miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19.12.24 16:47, Zi Yan wrote: > On 19 Dec 2024, at 10:39, David Hildenbrand wrote: > >> On 19.12.24 16:08, Zi Yan wrote: >>> On 19 Dec 2024, at 9:19, Zi Yan wrote: >>> >>>> On 19 Dec 2024, at 8:05, David Hildenbrand wrote: >>>> >>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>> waits may get stuck. >>>>>> >>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>> --- >>>>>> mm/migrate.c | 5 ++++- >>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>> index df91248755e4..fe73284e5246 100644 >>>>>> --- a/mm/migrate.c >>>>>> +++ b/mm/migrate.c >>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>> */ >>>>>> switch (mode) { >>>>>> case MIGRATE_SYNC: >>>>>> - break; >>>>>> + if (!src->mapping || >>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>> + break; >>>>>> + fallthrough; >>>>>> default: >>>>>> rc = -EBUSY; >>>>>> goto out; >>>>> >>>>> Ehm, doesn't this mean that any fuse user can essentially completely block CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>> >>>>> That sounds very bad. >>>> >>>> Yeah, these writeback folios become unmovable. It makes memory fragmentation >>>> unrecoverable. I do not know why AS_WRITEBACK_INDETERMINATE is allowed, since >>>> it is essentially a forever pin to writeback folios. Why not introduce a >>>> retry and timeout mechanism instead of waiting for the writeback forever? >>> >>> If there is no way around such indeterminate writebacks, to avoid fragment memory, >>> these to-be-written-back folios should be migrated to a physically contiguous region. Either you have a preallocated region or get free pages from MIGRATE_UNMOVABLE. >> >> But at what point? > > Before each writeback. And there should be a limit on the amount of unmovable > pages they can allocate. The question is if that is than still a performance win :) But yes, we can avoid another migration if we are already on allows-movable memory. > >> >> We surely don't want to make fuse consume only effectively-unmovable memory. > > Yes, that is undesirable, but the folio under writeback cannot be migrated, > since migration needs to wait until its finish. Right, and currently that works by immediately marking the folio clean again (IIUC after reading the cover letter). -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 13:05 ` David Hildenbrand 2024-12-19 14:19 ` Zi Yan @ 2024-12-19 15:43 ` Shakeel Butt 2024-12-19 15:47 ` David Hildenbrand 2025-04-02 21:34 ` Joanne Koong 2 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 15:43 UTC (permalink / raw) To: David Hildenbrand Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: > On 23.11.24 00:23, Joanne Koong wrote: > > For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > > it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > > mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > > writeback may take an indeterminate amount of time to complete, and > > waits may get stuck. > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > > --- > > mm/migrate.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index df91248755e4..fe73284e5246 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > > */ > > switch (mode) { > > case MIGRATE_SYNC: > > - break; > > + if (!src->mapping || > > + !mapping_writeback_indeterminate(src->mapping)) > > + break; > > + fallthrough; > > default: > > rc = -EBUSY; > > goto out; > > Ehm, doesn't this mean that any fuse user can essentially completely block > CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! > > That sounds very bad. The page under writeback are already unmovable while they are under writeback. This patch is only making potentially unrelated tasks to synchronously wait on writeback completion for such pages which in worst case can be indefinite. This actually is solving an isolation issue on a multi-tenant machine. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:43 ` Shakeel Butt @ 2024-12-19 15:47 ` David Hildenbrand 2024-12-19 15:53 ` Shakeel Butt 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 15:47 UTC (permalink / raw) To: Shakeel Butt Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On 19.12.24 16:43, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >> On 23.11.24 00:23, Joanne Koong wrote: >>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>> writeback may take an indeterminate amount of time to complete, and >>> waits may get stuck. >>> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>> --- >>> mm/migrate.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index df91248755e4..fe73284e5246 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>> */ >>> switch (mode) { >>> case MIGRATE_SYNC: >>> - break; >>> + if (!src->mapping || >>> + !mapping_writeback_indeterminate(src->mapping)) >>> + break; >>> + fallthrough; >>> default: >>> rc = -EBUSY; >>> goto out; >> >> Ehm, doesn't this mean that any fuse user can essentially completely block >> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >> >> That sounds very bad. > > The page under writeback are already unmovable while they are under > writeback. This patch is only making potentially unrelated tasks to > synchronously wait on writeback completion for such pages which in worst > case can be indefinite. This actually is solving an isolation issue on a > multi-tenant machine. > Are you sure, because I read in the cover letter: "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: support writable mmap"))), a temp page is allocated for every dirty page to be written back, the contents of the dirty page are copied over to the temp page, and the temp page gets handed to the server to write back. This is done so that writeback may be immediately cleared on the dirty page," Which to me means that they are immediately movable again? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:47 ` David Hildenbrand @ 2024-12-19 15:53 ` Shakeel Butt 2024-12-19 15:55 ` Zi Yan 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 15:53 UTC (permalink / raw) To: David Hildenbrand Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: > On 19.12.24 16:43, Shakeel Butt wrote: > > On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: > > > On 23.11.24 00:23, Joanne Koong wrote: > > > > For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > > > > it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > > > > mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > > > > writeback may take an indeterminate amount of time to complete, and > > > > waits may get stuck. > > > > > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > > > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > > > > --- > > > > mm/migrate.c | 5 ++++- > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > > > index df91248755e4..fe73284e5246 100644 > > > > --- a/mm/migrate.c > > > > +++ b/mm/migrate.c > > > > @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > > > > */ > > > > switch (mode) { > > > > case MIGRATE_SYNC: > > > > - break; > > > > + if (!src->mapping || > > > > + !mapping_writeback_indeterminate(src->mapping)) > > > > + break; > > > > + fallthrough; > > > > default: > > > > rc = -EBUSY; > > > > goto out; > > > > > > Ehm, doesn't this mean that any fuse user can essentially completely block > > > CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! > > > > > > That sounds very bad. > > > > The page under writeback are already unmovable while they are under > > writeback. This patch is only making potentially unrelated tasks to > > synchronously wait on writeback completion for such pages which in worst > > case can be indefinite. This actually is solving an isolation issue on a > > multi-tenant machine. > > > Are you sure, because I read in the cover letter: > > "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: > support writable mmap"))), a temp page is allocated for every dirty > page to be written back, the contents of the dirty page are copied over to > the temp page, and the temp page gets handed to the server to write back. > This is done so that writeback may be immediately cleared on the dirty > page," > > Which to me means that they are immediately movable again? Oh sorry, my mistake, yes this will become an isolation issue with the removal of the temp page in-between which this series is doing. I think the tradeoff is between extra memory plus slow write performance versus temporary unmovable memory. > > -- > Cheers, > > David / dhildenb > > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:53 ` Shakeel Butt @ 2024-12-19 15:55 ` Zi Yan 2024-12-19 15:56 ` Bernd Schubert 2024-12-19 16:22 ` Shakeel Butt 0 siblings, 2 replies; 124+ messages in thread From: Zi Yan @ 2024-12-19 15:55 UTC (permalink / raw) To: Shakeel Butt Cc: David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19 Dec 2024, at 10:53, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >> On 19.12.24 16:43, Shakeel Butt wrote: >>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>> writeback may take an indeterminate amount of time to complete, and >>>>> waits may get stuck. >>>>> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>> --- >>>>> mm/migrate.c | 5 ++++- >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>> index df91248755e4..fe73284e5246 100644 >>>>> --- a/mm/migrate.c >>>>> +++ b/mm/migrate.c >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>> */ >>>>> switch (mode) { >>>>> case MIGRATE_SYNC: >>>>> - break; >>>>> + if (!src->mapping || >>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>> + break; >>>>> + fallthrough; >>>>> default: >>>>> rc = -EBUSY; >>>>> goto out; >>>> >>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>> >>>> That sounds very bad. >>> >>> The page under writeback are already unmovable while they are under >>> writeback. This patch is only making potentially unrelated tasks to >>> synchronously wait on writeback completion for such pages which in worst >>> case can be indefinite. This actually is solving an isolation issue on a >>> multi-tenant machine. >>> >> Are you sure, because I read in the cover letter: >> >> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >> support writable mmap"))), a temp page is allocated for every dirty >> page to be written back, the contents of the dirty page are copied over to >> the temp page, and the temp page gets handed to the server to write back. >> This is done so that writeback may be immediately cleared on the dirty >> page," >> >> Which to me means that they are immediately movable again? > > Oh sorry, my mistake, yes this will become an isolation issue with the > removal of the temp page in-between which this series is doing. I think > the tradeoff is between extra memory plus slow write performance versus > temporary unmovable memory. No, the tradeoff is slow FUSE performance vs whole system slowdown due to memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not temporary. -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:55 ` Zi Yan @ 2024-12-19 15:56 ` Bernd Schubert 2024-12-19 16:00 ` Zi Yan 2024-12-19 16:22 ` Shakeel Butt 1 sibling, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-19 15:56 UTC (permalink / raw) To: Zi Yan, Shakeel Butt Cc: David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/19/24 16:55, Zi Yan wrote: > On 19 Dec 2024, at 10:53, Shakeel Butt wrote: > >> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >>> On 19.12.24 16:43, Shakeel Butt wrote: >>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>> waits may get stuck. >>>>>> >>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>> --- >>>>>> mm/migrate.c | 5 ++++- >>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>> index df91248755e4..fe73284e5246 100644 >>>>>> --- a/mm/migrate.c >>>>>> +++ b/mm/migrate.c >>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>> */ >>>>>> switch (mode) { >>>>>> case MIGRATE_SYNC: >>>>>> - break; >>>>>> + if (!src->mapping || >>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>> + break; >>>>>> + fallthrough; >>>>>> default: >>>>>> rc = -EBUSY; >>>>>> goto out; >>>>> >>>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>> >>>>> That sounds very bad. >>>> >>>> The page under writeback are already unmovable while they are under >>>> writeback. This patch is only making potentially unrelated tasks to >>>> synchronously wait on writeback completion for such pages which in worst >>>> case can be indefinite. This actually is solving an isolation issue on a >>>> multi-tenant machine. >>>> >>> Are you sure, because I read in the cover letter: >>> >>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >>> support writable mmap"))), a temp page is allocated for every dirty >>> page to be written back, the contents of the dirty page are copied over to >>> the temp page, and the temp page gets handed to the server to write back. >>> This is done so that writeback may be immediately cleared on the dirty >>> page," >>> >>> Which to me means that they are immediately movable again? >> >> Oh sorry, my mistake, yes this will become an isolation issue with the >> removal of the temp page in-between which this series is doing. I think >> the tradeoff is between extra memory plus slow write performance versus >> temporary unmovable memory. > > No, the tradeoff is slow FUSE performance vs whole system slowdown due to > memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not > temporary. Is there is a difference between FUSE TMP page being unmovable and AS_WRITEBACK_INDETERMINATE folios/pages being unmovable? Thanks, Bernd AS_WRITEBACK_INDETERMINATE ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:56 ` Bernd Schubert @ 2024-12-19 16:00 ` Zi Yan 2024-12-19 16:02 ` Zi Yan 0 siblings, 1 reply; 124+ messages in thread From: Zi Yan @ 2024-12-19 16:00 UTC (permalink / raw) To: Bernd Schubert Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko -- Best Regards, Yan, Zi On 19 Dec 2024, at 10:56, Bernd Schubert wrote: > On 12/19/24 16:55, Zi Yan wrote: >> On 19 Dec 2024, at 10:53, Shakeel Butt wrote: >> >>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >>>> On 19.12.24 16:43, Shakeel Butt wrote: >>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>>> waits may get stuck. >>>>>>> >>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>>> --- >>>>>>> mm/migrate.c | 5 ++++- >>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>>> index df91248755e4..fe73284e5246 100644 >>>>>>> --- a/mm/migrate.c >>>>>>> +++ b/mm/migrate.c >>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>>> */ >>>>>>> switch (mode) { >>>>>>> case MIGRATE_SYNC: >>>>>>> - break; >>>>>>> + if (!src->mapping || >>>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>>> + break; >>>>>>> + fallthrough; >>>>>>> default: >>>>>>> rc = -EBUSY; >>>>>>> goto out; >>>>>> >>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>>> >>>>>> That sounds very bad. >>>>> >>>>> The page under writeback are already unmovable while they are under >>>>> writeback. This patch is only making potentially unrelated tasks to >>>>> synchronously wait on writeback completion for such pages which in worst >>>>> case can be indefinite. This actually is solving an isolation issue on a >>>>> multi-tenant machine. >>>>> >>>> Are you sure, because I read in the cover letter: >>>> >>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >>>> support writable mmap"))), a temp page is allocated for every dirty >>>> page to be written back, the contents of the dirty page are copied over to >>>> the temp page, and the temp page gets handed to the server to write back. >>>> This is done so that writeback may be immediately cleared on the dirty >>>> page," >>>> >>>> Which to me means that they are immediately movable again? >>> >>> Oh sorry, my mistake, yes this will become an isolation issue with the >>> removal of the temp page in-between which this series is doing. I think >>> the tradeoff is between extra memory plus slow write performance versus >>> temporary unmovable memory. >> >> No, the tradeoff is slow FUSE performance vs whole system slowdown due to >> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not >> temporary. > > Is there is a difference between FUSE TMP page being unmovable and > AS_WRITEBACK_INDETERMINATE folios/pages being unmovable? Both are unmovable, but you can control where FUSE TMP page can come from to avoid spread across the entire memory space. For example, allocate a contiguous region as a TMP page pool. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:00 ` Zi Yan @ 2024-12-19 16:02 ` Zi Yan 2024-12-19 16:09 ` Bernd Schubert 0 siblings, 1 reply; 124+ messages in thread From: Zi Yan @ 2024-12-19 16:02 UTC (permalink / raw) To: Bernd Schubert Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19 Dec 2024, at 11:00, Zi Yan wrote: > On 19 Dec 2024, at 10:56, Bernd Schubert wrote: > >> On 12/19/24 16:55, Zi Yan wrote: >>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote: >>> >>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >>>>> On 19.12.24 16:43, Shakeel Butt wrote: >>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>>>> waits may get stuck. >>>>>>>> >>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>>>> --- >>>>>>>> mm/migrate.c | 5 ++++- >>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>>>> index df91248755e4..fe73284e5246 100644 >>>>>>>> --- a/mm/migrate.c >>>>>>>> +++ b/mm/migrate.c >>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>>>> */ >>>>>>>> switch (mode) { >>>>>>>> case MIGRATE_SYNC: >>>>>>>> - break; >>>>>>>> + if (!src->mapping || >>>>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>>>> + break; >>>>>>>> + fallthrough; >>>>>>>> default: >>>>>>>> rc = -EBUSY; >>>>>>>> goto out; >>>>>>> >>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>>>> >>>>>>> That sounds very bad. >>>>>> >>>>>> The page under writeback are already unmovable while they are under >>>>>> writeback. This patch is only making potentially unrelated tasks to >>>>>> synchronously wait on writeback completion for such pages which in worst >>>>>> case can be indefinite. This actually is solving an isolation issue on a >>>>>> multi-tenant machine. >>>>>> >>>>> Are you sure, because I read in the cover letter: >>>>> >>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >>>>> support writable mmap"))), a temp page is allocated for every dirty >>>>> page to be written back, the contents of the dirty page are copied over to >>>>> the temp page, and the temp page gets handed to the server to write back. >>>>> This is done so that writeback may be immediately cleared on the dirty >>>>> page," >>>>> >>>>> Which to me means that they are immediately movable again? >>>> >>>> Oh sorry, my mistake, yes this will become an isolation issue with the >>>> removal of the temp page in-between which this series is doing. I think >>>> the tradeoff is between extra memory plus slow write performance versus >>>> temporary unmovable memory. >>> >>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to >>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not >>> temporary. >> >> Is there is a difference between FUSE TMP page being unmovable and >> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable? (Fix my response location) Both are unmovable, but you can control where FUSE TMP page can come from to avoid spread across the entire memory space. For example, allocate a contiguous region as a TMP page pool. -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:02 ` Zi Yan @ 2024-12-19 16:09 ` Bernd Schubert 2024-12-19 16:14 ` Zi Yan 0 siblings, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-19 16:09 UTC (permalink / raw) To: Zi Yan Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/19/24 17:02, Zi Yan wrote: > On 19 Dec 2024, at 11:00, Zi Yan wrote: >> On 19 Dec 2024, at 10:56, Bernd Schubert wrote: >> >>> On 12/19/24 16:55, Zi Yan wrote: >>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote: >>>> >>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >>>>>> On 19.12.24 16:43, Shakeel Butt wrote: >>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>>>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>>>>> waits may get stuck. >>>>>>>>> >>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>>>>> --- >>>>>>>>> mm/migrate.c | 5 ++++- >>>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>>>>> index df91248755e4..fe73284e5246 100644 >>>>>>>>> --- a/mm/migrate.c >>>>>>>>> +++ b/mm/migrate.c >>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>>>>> */ >>>>>>>>> switch (mode) { >>>>>>>>> case MIGRATE_SYNC: >>>>>>>>> - break; >>>>>>>>> + if (!src->mapping || >>>>>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>>>>> + break; >>>>>>>>> + fallthrough; >>>>>>>>> default: >>>>>>>>> rc = -EBUSY; >>>>>>>>> goto out; >>>>>>>> >>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>>>>> >>>>>>>> That sounds very bad. >>>>>>> >>>>>>> The page under writeback are already unmovable while they are under >>>>>>> writeback. This patch is only making potentially unrelated tasks to >>>>>>> synchronously wait on writeback completion for such pages which in worst >>>>>>> case can be indefinite. This actually is solving an isolation issue on a >>>>>>> multi-tenant machine. >>>>>>> >>>>>> Are you sure, because I read in the cover letter: >>>>>> >>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >>>>>> support writable mmap"))), a temp page is allocated for every dirty >>>>>> page to be written back, the contents of the dirty page are copied over to >>>>>> the temp page, and the temp page gets handed to the server to write back. >>>>>> This is done so that writeback may be immediately cleared on the dirty >>>>>> page," >>>>>> >>>>>> Which to me means that they are immediately movable again? >>>>> >>>>> Oh sorry, my mistake, yes this will become an isolation issue with the >>>>> removal of the temp page in-between which this series is doing. I think >>>>> the tradeoff is between extra memory plus slow write performance versus >>>>> temporary unmovable memory. >>>> >>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to >>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not >>>> temporary. >>> >>> Is there is a difference between FUSE TMP page being unmovable and >>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable? > > (Fix my response location) > > Both are unmovable, but you can control where FUSE TMP page > can come from to avoid spread across the entire memory space. For example, > allocate a contiguous region as a TMP page pool. Wouldn't it make sense to have that for fuse writeback pages as well? Fuse tries to limit dirty pages anyway. Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:09 ` Bernd Schubert @ 2024-12-19 16:14 ` Zi Yan 2024-12-19 16:26 ` Shakeel Butt 0 siblings, 1 reply; 124+ messages in thread From: Zi Yan @ 2024-12-19 16:14 UTC (permalink / raw) To: Bernd Schubert Cc: Shakeel Butt, David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19 Dec 2024, at 11:09, Bernd Schubert wrote: > On 12/19/24 17:02, Zi Yan wrote: >> On 19 Dec 2024, at 11:00, Zi Yan wrote: >>> On 19 Dec 2024, at 10:56, Bernd Schubert wrote: >>> >>>> On 12/19/24 16:55, Zi Yan wrote: >>>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote: >>>>> >>>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >>>>>>> On 19.12.24 16:43, Shakeel Butt wrote: >>>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>>>>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>>>>>> waits may get stuck. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>>>>>> --- >>>>>>>>>> mm/migrate.c | 5 ++++- >>>>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>>>>>> index df91248755e4..fe73284e5246 100644 >>>>>>>>>> --- a/mm/migrate.c >>>>>>>>>> +++ b/mm/migrate.c >>>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>>>>>> */ >>>>>>>>>> switch (mode) { >>>>>>>>>> case MIGRATE_SYNC: >>>>>>>>>> - break; >>>>>>>>>> + if (!src->mapping || >>>>>>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>>>>>> + break; >>>>>>>>>> + fallthrough; >>>>>>>>>> default: >>>>>>>>>> rc = -EBUSY; >>>>>>>>>> goto out; >>>>>>>>> >>>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>>>>>> >>>>>>>>> That sounds very bad. >>>>>>>> >>>>>>>> The page under writeback are already unmovable while they are under >>>>>>>> writeback. This patch is only making potentially unrelated tasks to >>>>>>>> synchronously wait on writeback completion for such pages which in worst >>>>>>>> case can be indefinite. This actually is solving an isolation issue on a >>>>>>>> multi-tenant machine. >>>>>>>> >>>>>>> Are you sure, because I read in the cover letter: >>>>>>> >>>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >>>>>>> support writable mmap"))), a temp page is allocated for every dirty >>>>>>> page to be written back, the contents of the dirty page are copied over to >>>>>>> the temp page, and the temp page gets handed to the server to write back. >>>>>>> This is done so that writeback may be immediately cleared on the dirty >>>>>>> page," >>>>>>> >>>>>>> Which to me means that they are immediately movable again? >>>>>> >>>>>> Oh sorry, my mistake, yes this will become an isolation issue with the >>>>>> removal of the temp page in-between which this series is doing. I think >>>>>> the tradeoff is between extra memory plus slow write performance versus >>>>>> temporary unmovable memory. >>>>> >>>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to >>>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not >>>>> temporary. >>>> >>>> Is there is a difference between FUSE TMP page being unmovable and >>>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable? >> >> (Fix my response location) >> >> Both are unmovable, but you can control where FUSE TMP page >> can come from to avoid spread across the entire memory space. For example, >> allocate a contiguous region as a TMP page pool. > > Wouldn't it make sense to have that for fuse writeback pages as well? > Fuse tries to limit dirty pages anyway. Can fuse constraint the location of writeback pages? Something like what I proposed[1], migrating pages to a location before their writeback? Will that be a performance concern? In terms of the number of dirty pages, you only need one page out of 512 pages to prevent 2MB THP from allocation. For CMA allocation, one unmovable page can kill one contiguous range. What is the limit of fuse dirty pages? [1] https://lore.kernel.org/linux-mm/90C41581-179F-40B6-9801-9C9DBBEB1AF4@nvidia.com/ -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:14 ` Zi Yan @ 2024-12-19 16:26 ` Shakeel Butt 2024-12-19 16:31 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 16:26 UTC (permalink / raw) To: Zi Yan Cc: Bernd Schubert, David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 11:14:49AM -0500, Zi Yan wrote: > On 19 Dec 2024, at 11:09, Bernd Schubert wrote: > > > On 12/19/24 17:02, Zi Yan wrote: > >> On 19 Dec 2024, at 11:00, Zi Yan wrote: > >>> On 19 Dec 2024, at 10:56, Bernd Schubert wrote: > >>> > >>>> On 12/19/24 16:55, Zi Yan wrote: > >>>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote: > >>>>> > >>>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: > >>>>>>> On 19.12.24 16:43, Shakeel Butt wrote: > >>>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: > >>>>>>>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > >>>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > >>>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > >>>>>>>>>> writeback may take an indeterminate amount of time to complete, and > >>>>>>>>>> waits may get stuck. > >>>>>>>>>> > >>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > >>>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > >>>>>>>>>> --- > >>>>>>>>>> mm/migrate.c | 5 ++++- > >>>>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c > >>>>>>>>>> index df91248755e4..fe73284e5246 100644 > >>>>>>>>>> --- a/mm/migrate.c > >>>>>>>>>> +++ b/mm/migrate.c > >>>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > >>>>>>>>>> */ > >>>>>>>>>> switch (mode) { > >>>>>>>>>> case MIGRATE_SYNC: > >>>>>>>>>> - break; > >>>>>>>>>> + if (!src->mapping || > >>>>>>>>>> + !mapping_writeback_indeterminate(src->mapping)) > >>>>>>>>>> + break; > >>>>>>>>>> + fallthrough; > >>>>>>>>>> default: > >>>>>>>>>> rc = -EBUSY; > >>>>>>>>>> goto out; > >>>>>>>>> > >>>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block > >>>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! > >>>>>>>>> > >>>>>>>>> That sounds very bad. > >>>>>>>> > >>>>>>>> The page under writeback are already unmovable while they are under > >>>>>>>> writeback. This patch is only making potentially unrelated tasks to > >>>>>>>> synchronously wait on writeback completion for such pages which in worst > >>>>>>>> case can be indefinite. This actually is solving an isolation issue on a > >>>>>>>> multi-tenant machine. > >>>>>>>> > >>>>>>> Are you sure, because I read in the cover letter: > >>>>>>> > >>>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: > >>>>>>> support writable mmap"))), a temp page is allocated for every dirty > >>>>>>> page to be written back, the contents of the dirty page are copied over to > >>>>>>> the temp page, and the temp page gets handed to the server to write back. > >>>>>>> This is done so that writeback may be immediately cleared on the dirty > >>>>>>> page," > >>>>>>> > >>>>>>> Which to me means that they are immediately movable again? > >>>>>> > >>>>>> Oh sorry, my mistake, yes this will become an isolation issue with the > >>>>>> removal of the temp page in-between which this series is doing. I think > >>>>>> the tradeoff is between extra memory plus slow write performance versus > >>>>>> temporary unmovable memory. > >>>>> > >>>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to > >>>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not > >>>>> temporary. > >>>> > >>>> Is there is a difference between FUSE TMP page being unmovable and > >>>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable? > >> > >> (Fix my response location) > >> > >> Both are unmovable, but you can control where FUSE TMP page > >> can come from to avoid spread across the entire memory space. For example, > >> allocate a contiguous region as a TMP page pool. > > > > Wouldn't it make sense to have that for fuse writeback pages as well? > > Fuse tries to limit dirty pages anyway. > > Can fuse constraint the location of writeback pages? Something like what > I proposed[1], migrating pages to a location before their writeback? Will > that be a performance concern? > > In terms of the number of dirty pages, you only need one page out of 512 > pages to prevent 2MB THP from allocation. For CMA allocation, one unmovable > page can kill one contiguous range. What is the limit of fuse dirty pages? > > [1] https://lore.kernel.org/linux-mm/90C41581-179F-40B6-9801-9C9DBBEB1AF4@nvidia.com/ I think this whole concern of fuse making system memory unmovable forever is overblown. Fuse is already using a temp (unmovable) page for the writeback and is slow and is being removed in this series. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:26 ` Shakeel Butt @ 2024-12-19 16:31 ` David Hildenbrand 2024-12-19 16:53 ` Shakeel Butt 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 16:31 UTC (permalink / raw) To: Shakeel Butt, Zi Yan Cc: Bernd Schubert, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19.12.24 17:26, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 11:14:49AM -0500, Zi Yan wrote: >> On 19 Dec 2024, at 11:09, Bernd Schubert wrote: >> >>> On 12/19/24 17:02, Zi Yan wrote: >>>> On 19 Dec 2024, at 11:00, Zi Yan wrote: >>>>> On 19 Dec 2024, at 10:56, Bernd Schubert wrote: >>>>> >>>>>> On 12/19/24 16:55, Zi Yan wrote: >>>>>>> On 19 Dec 2024, at 10:53, Shakeel Butt wrote: >>>>>>> >>>>>>>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >>>>>>>>> On 19.12.24 16:43, Shakeel Butt wrote: >>>>>>>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>>>>>>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>>>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>>>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>>>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>>>>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>>>>>>>> waits may get stuck. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>>>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>>>>>>>> --- >>>>>>>>>>>> mm/migrate.c | 5 ++++- >>>>>>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>>>>>>>> index df91248755e4..fe73284e5246 100644 >>>>>>>>>>>> --- a/mm/migrate.c >>>>>>>>>>>> +++ b/mm/migrate.c >>>>>>>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>>>>>>>> */ >>>>>>>>>>>> switch (mode) { >>>>>>>>>>>> case MIGRATE_SYNC: >>>>>>>>>>>> - break; >>>>>>>>>>>> + if (!src->mapping || >>>>>>>>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>>>>>>>> + break; >>>>>>>>>>>> + fallthrough; >>>>>>>>>>>> default: >>>>>>>>>>>> rc = -EBUSY; >>>>>>>>>>>> goto out; >>>>>>>>>>> >>>>>>>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>>>>>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>>>>>>>> >>>>>>>>>>> That sounds very bad. >>>>>>>>>> >>>>>>>>>> The page under writeback are already unmovable while they are under >>>>>>>>>> writeback. This patch is only making potentially unrelated tasks to >>>>>>>>>> synchronously wait on writeback completion for such pages which in worst >>>>>>>>>> case can be indefinite. This actually is solving an isolation issue on a >>>>>>>>>> multi-tenant machine. >>>>>>>>>> >>>>>>>>> Are you sure, because I read in the cover letter: >>>>>>>>> >>>>>>>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >>>>>>>>> support writable mmap"))), a temp page is allocated for every dirty >>>>>>>>> page to be written back, the contents of the dirty page are copied over to >>>>>>>>> the temp page, and the temp page gets handed to the server to write back. >>>>>>>>> This is done so that writeback may be immediately cleared on the dirty >>>>>>>>> page," >>>>>>>>> >>>>>>>>> Which to me means that they are immediately movable again? >>>>>>>> >>>>>>>> Oh sorry, my mistake, yes this will become an isolation issue with the >>>>>>>> removal of the temp page in-between which this series is doing. I think >>>>>>>> the tradeoff is between extra memory plus slow write performance versus >>>>>>>> temporary unmovable memory. >>>>>>> >>>>>>> No, the tradeoff is slow FUSE performance vs whole system slowdown due to >>>>>>> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not >>>>>>> temporary. >>>>>> >>>>>> Is there is a difference between FUSE TMP page being unmovable and >>>>>> AS_WRITEBACK_INDETERMINATE folios/pages being unmovable? >>>> >>>> (Fix my response location) >>>> >>>> Both are unmovable, but you can control where FUSE TMP page >>>> can come from to avoid spread across the entire memory space. For example, >>>> allocate a contiguous region as a TMP page pool. >>> >>> Wouldn't it make sense to have that for fuse writeback pages as well? >>> Fuse tries to limit dirty pages anyway. >> >> Can fuse constraint the location of writeback pages? Something like what >> I proposed[1], migrating pages to a location before their writeback? Will >> that be a performance concern? >> >> In terms of the number of dirty pages, you only need one page out of 512 >> pages to prevent 2MB THP from allocation. For CMA allocation, one unmovable >> page can kill one contiguous range. What is the limit of fuse dirty pages? >> >> [1] https://lore.kernel.org/linux-mm/90C41581-179F-40B6-9801-9C9DBBEB1AF4@nvidia.com/ > > I think this whole concern of fuse making system memory unmovable > forever is overblown. Fuse is already using a temp (unmovable) page Right, and we allocated in a way that we expect it to not be movable (e.g., not on ZONE_MOVABLE, usually in a UNMOVABLE pageblock etc). As another question, which effect does this change here have on folio_wait_writeback() users like arch/s390/kernel/uv.c or shrink_folio_list()? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:31 ` David Hildenbrand @ 2024-12-19 16:53 ` Shakeel Butt 0 siblings, 0 replies; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 16:53 UTC (permalink / raw) To: David Hildenbrand Cc: Zi Yan, Bernd Schubert, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 05:31:14PM +0100, David Hildenbrand wrote: [...] > > I think this whole concern of fuse making system memory unmovable > > forever is overblown. Fuse is already using a temp (unmovable) page > > Right, and we allocated in a way that we expect it to not be movable (e.g., > not on ZONE_MOVABLE, usually in a UNMOVABLE pageblock etc). > > As another question, which effect does this change here have on > folio_wait_writeback() users like arch/s390/kernel/uv.c or > shrink_folio_list()? > shrink_folio_list() is handled in second patch [1] of this series. To summarize only memcg-v1 which does not have sane dirty throttling can be impacted and needs change. For arch/s390/kernel/uv.c, I don't think this series is doing anything. For sane fuse folios, things should be fine. [1] https://lore.kernel.org/linux-mm/CAJnrk1bXDkwExR=ztnidX4DAvVD5wZZemEVNt9bg=tkwWAg6fw@mail.gmail.com/T/#m02461fb4fb73849900e811d695deee0706c370f9 ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 15:55 ` Zi Yan 2024-12-19 15:56 ` Bernd Schubert @ 2024-12-19 16:22 ` Shakeel Butt 2024-12-19 16:29 ` David Hildenbrand 1 sibling, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 16:22 UTC (permalink / raw) To: Zi Yan Cc: David Hildenbrand, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 10:55:10AM -0500, Zi Yan wrote: > On 19 Dec 2024, at 10:53, Shakeel Butt wrote: > > > On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: > >> On 19.12.24 16:43, Shakeel Butt wrote: > >>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: > >>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > >>>>> writeback may take an indeterminate amount of time to complete, and > >>>>> waits may get stuck. > >>>>> > >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > >>>>> --- > >>>>> mm/migrate.c | 5 ++++- > >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/mm/migrate.c b/mm/migrate.c > >>>>> index df91248755e4..fe73284e5246 100644 > >>>>> --- a/mm/migrate.c > >>>>> +++ b/mm/migrate.c > >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > >>>>> */ > >>>>> switch (mode) { > >>>>> case MIGRATE_SYNC: > >>>>> - break; > >>>>> + if (!src->mapping || > >>>>> + !mapping_writeback_indeterminate(src->mapping)) > >>>>> + break; > >>>>> + fallthrough; > >>>>> default: > >>>>> rc = -EBUSY; > >>>>> goto out; > >>>> > >>>> Ehm, doesn't this mean that any fuse user can essentially completely block > >>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! > >>>> > >>>> That sounds very bad. > >>> > >>> The page under writeback are already unmovable while they are under > >>> writeback. This patch is only making potentially unrelated tasks to > >>> synchronously wait on writeback completion for such pages which in worst > >>> case can be indefinite. This actually is solving an isolation issue on a > >>> multi-tenant machine. > >>> > >> Are you sure, because I read in the cover letter: > >> > >> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: > >> support writable mmap"))), a temp page is allocated for every dirty > >> page to be written back, the contents of the dirty page are copied over to > >> the temp page, and the temp page gets handed to the server to write back. > >> This is done so that writeback may be immediately cleared on the dirty > >> page," > >> > >> Which to me means that they are immediately movable again? > > > > Oh sorry, my mistake, yes this will become an isolation issue with the > > removal of the temp page in-between which this series is doing. I think > > the tradeoff is between extra memory plus slow write performance versus > > temporary unmovable memory. > > No, the tradeoff is slow FUSE performance vs whole system slowdown due to > memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not > temporary. If you check the code just above this patch, this mapping_writeback_indeterminate() check only happen for pages under writeback which is a temp state. Anyways, fuse folios should not be unmovable for their lifetime but only while under writeback which is same for all fs. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:22 ` Shakeel Butt @ 2024-12-19 16:29 ` David Hildenbrand 2024-12-19 16:40 ` Shakeel Butt 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 16:29 UTC (permalink / raw) To: Shakeel Butt, Zi Yan Cc: Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19.12.24 17:22, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 10:55:10AM -0500, Zi Yan wrote: >> On 19 Dec 2024, at 10:53, Shakeel Butt wrote: >> >>> On Thu, Dec 19, 2024 at 04:47:18PM +0100, David Hildenbrand wrote: >>>> On 19.12.24 16:43, Shakeel Butt wrote: >>>>> On Thu, Dec 19, 2024 at 02:05:04PM +0100, David Hildenbrand wrote: >>>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>>> waits may get stuck. >>>>>>> >>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>>> --- >>>>>>> mm/migrate.c | 5 ++++- >>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>>> index df91248755e4..fe73284e5246 100644 >>>>>>> --- a/mm/migrate.c >>>>>>> +++ b/mm/migrate.c >>>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>>> */ >>>>>>> switch (mode) { >>>>>>> case MIGRATE_SYNC: >>>>>>> - break; >>>>>>> + if (!src->mapping || >>>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>>> + break; >>>>>>> + fallthrough; >>>>>>> default: >>>>>>> rc = -EBUSY; >>>>>>> goto out; >>>>>> >>>>>> Ehm, doesn't this mean that any fuse user can essentially completely block >>>>>> CMA allocations, memory compaction, memory hotunplug, memory poisoning... ?! >>>>>> >>>>>> That sounds very bad. >>>>> >>>>> The page under writeback are already unmovable while they are under >>>>> writeback. This patch is only making potentially unrelated tasks to >>>>> synchronously wait on writeback completion for such pages which in worst >>>>> case can be indefinite. This actually is solving an isolation issue on a >>>>> multi-tenant machine. >>>>> >>>> Are you sure, because I read in the cover letter: >>>> >>>> "In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: >>>> support writable mmap"))), a temp page is allocated for every dirty >>>> page to be written back, the contents of the dirty page are copied over to >>>> the temp page, and the temp page gets handed to the server to write back. >>>> This is done so that writeback may be immediately cleared on the dirty >>>> page," >>>> >>>> Which to me means that they are immediately movable again? >>> >>> Oh sorry, my mistake, yes this will become an isolation issue with the >>> removal of the temp page in-between which this series is doing. I think >>> the tradeoff is between extra memory plus slow write performance versus >>> temporary unmovable memory. >> >> No, the tradeoff is slow FUSE performance vs whole system slowdown due to >> memory fragmentation. AS_WRITEBACK_INDETERMINATE indicates it is not >> temporary. > > If you check the code just above this patch, this > mapping_writeback_indeterminate() check only happen for pages under > writeback which is a temp state. Anyways, fuse folios should not be > unmovable for their lifetime but only while under writeback which is > same for all fs. But there, writeback is expected to be a temporary thing, not possibly: "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA guarantees, and unfortunately, it sounds like this is the case here, unless I am missing something important. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:29 ` David Hildenbrand @ 2024-12-19 16:40 ` Shakeel Butt 2024-12-19 16:41 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 16:40 UTC (permalink / raw) To: David Hildenbrand Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: [...] > > > > If you check the code just above this patch, this > > mapping_writeback_indeterminate() check only happen for pages under > > writeback which is a temp state. Anyways, fuse folios should not be > > unmovable for their lifetime but only while under writeback which is > > same for all fs. > > But there, writeback is expected to be a temporary thing, not possibly: > "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. > > I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA > guarantees, and unfortunately, it sounds like this is the case here, unless > I am missing something important. > It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing the confusion. The writeback state is not indefinite. A proper fuse fs, like anyother fs, should handle writeback pages appropriately. These additional checks and skips are for (I think) untrusted fuse servers. Personally I think waiting indefinitely on writeback, particularly for sync compaction, should be fine but fuse maintainers want to avoid scenarios where an untrusted fuse server can force such stalls in other jobs. Yes, this will not solve the untrusted fuse server causing fragmentation issue but that is the risk of running untrusted fuse server, IMHO. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:40 ` Shakeel Butt @ 2024-12-19 16:41 ` David Hildenbrand 2024-12-19 17:14 ` Shakeel Butt 2024-12-20 7:55 ` Jingbo Xu 0 siblings, 2 replies; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 16:41 UTC (permalink / raw) To: Shakeel Butt Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19.12.24 17:40, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: > [...] >>> >>> If you check the code just above this patch, this >>> mapping_writeback_indeterminate() check only happen for pages under >>> writeback which is a temp state. Anyways, fuse folios should not be >>> unmovable for their lifetime but only while under writeback which is >>> same for all fs. >> >> But there, writeback is expected to be a temporary thing, not possibly: >> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. >> >> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA >> guarantees, and unfortunately, it sounds like this is the case here, unless >> I am missing something important. >> > > It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing > the confusion. The writeback state is not indefinite. A proper fuse fs, > like anyother fs, should handle writeback pages appropriately. These > additional checks and skips are for (I think) untrusted fuse servers. Can unprivileged user space provoke this case? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:41 ` David Hildenbrand @ 2024-12-19 17:14 ` Shakeel Butt 2024-12-19 17:26 ` David Hildenbrand 2024-12-20 7:55 ` Jingbo Xu 1 sibling, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 17:14 UTC (permalink / raw) To: David Hildenbrand Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: > On 19.12.24 17:40, Shakeel Butt wrote: > > On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: > > [...] > > > > > > > > If you check the code just above this patch, this > > > > mapping_writeback_indeterminate() check only happen for pages under > > > > writeback which is a temp state. Anyways, fuse folios should not be > > > > unmovable for their lifetime but only while under writeback which is > > > > same for all fs. > > > > > > But there, writeback is expected to be a temporary thing, not possibly: > > > "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. > > > > > > I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA > > > guarantees, and unfortunately, it sounds like this is the case here, unless > > > I am missing something important. > > > > > > > It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing > > the confusion. The writeback state is not indefinite. A proper fuse fs, > > like anyother fs, should handle writeback pages appropriately. These > > additional checks and skips are for (I think) untrusted fuse servers. > > Can unprivileged user space provoke this case? Let's ask Joanne and other fuse folks about the above question. Let's say unprivileged user space can start a untrusted fuse server, mount fuse, allocate and dirty a lot of fuse folios (within its dirty and memcg limits) and trigger the writeback. To cause pain (through fragmentation), it is not clearing the writeback state. Is this the scenario you are envisioning? ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:14 ` Shakeel Butt @ 2024-12-19 17:26 ` David Hildenbrand 2024-12-19 17:30 ` Bernd Schubert 2024-12-19 17:55 ` Joanne Koong 0 siblings, 2 replies; 124+ messages in thread From: David Hildenbrand @ 2024-12-19 17:26 UTC (permalink / raw) To: Shakeel Butt Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19.12.24 18:14, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: >> On 19.12.24 17:40, Shakeel Butt wrote: >>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: >>> [...] >>>>> >>>>> If you check the code just above this patch, this >>>>> mapping_writeback_indeterminate() check only happen for pages under >>>>> writeback which is a temp state. Anyways, fuse folios should not be >>>>> unmovable for their lifetime but only while under writeback which is >>>>> same for all fs. >>>> >>>> But there, writeback is expected to be a temporary thing, not possibly: >>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. >>>> >>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA >>>> guarantees, and unfortunately, it sounds like this is the case here, unless >>>> I am missing something important. >>>> >>> >>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing >>> the confusion. The writeback state is not indefinite. A proper fuse fs, >>> like anyother fs, should handle writeback pages appropriately. These >>> additional checks and skips are for (I think) untrusted fuse servers. >> >> Can unprivileged user space provoke this case? > > Let's ask Joanne and other fuse folks about the above question. > > Let's say unprivileged user space can start a untrusted fuse server, > mount fuse, allocate and dirty a lot of fuse folios (within its dirty > and memcg limits) and trigger the writeback. To cause pain (through > fragmentation), it is not clearing the writeback state. Is this the > scenario you are envisioning? Yes, for example causing harm on a shared host (containers, ...). If it cannot happen, we should make it very clear in documentation and patch descriptions that it can only cause harm with privileged user space, and that this harm can make things like CMA allocations, memory onplug, ... fail, which is rather bad and against concepts like ZONE_MOVABLE/MIGRATE_CMA. Although I wonder what would happen if the privileged user space daemon crashes (e.g., OOM killer?) and simply no longer replies to any messages. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:26 ` David Hildenbrand @ 2024-12-19 17:30 ` Bernd Schubert 2024-12-19 17:37 ` Shakeel Butt 2024-12-19 17:55 ` Joanne Koong 1 sibling, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-19 17:30 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/19/24 18:26, David Hildenbrand wrote: > On 19.12.24 18:14, Shakeel Butt wrote: >> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: >>> On 19.12.24 17:40, Shakeel Butt wrote: >>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: >>>> [...] >>>>>> >>>>>> If you check the code just above this patch, this >>>>>> mapping_writeback_indeterminate() check only happen for pages under >>>>>> writeback which is a temp state. Anyways, fuse folios should not be >>>>>> unmovable for their lifetime but only while under writeback which is >>>>>> same for all fs. >>>>> >>>>> But there, writeback is expected to be a temporary thing, not >>>>> possibly: >>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. >>>>> >>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA >>>>> guarantees, and unfortunately, it sounds like this is the case >>>>> here, unless >>>>> I am missing something important. >>>>> >>>> >>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing >>>> the confusion. The writeback state is not indefinite. A proper fuse fs, >>>> like anyother fs, should handle writeback pages appropriately. These >>>> additional checks and skips are for (I think) untrusted fuse servers. >>> >>> Can unprivileged user space provoke this case? >> >> Let's ask Joanne and other fuse folks about the above question. >> >> Let's say unprivileged user space can start a untrusted fuse server, >> mount fuse, allocate and dirty a lot of fuse folios (within its dirty >> and memcg limits) and trigger the writeback. To cause pain (through >> fragmentation), it is not clearing the writeback state. Is this the >> scenario you are envisioning? > > Yes, for example causing harm on a shared host (containers, ...). > > If it cannot happen, we should make it very clear in documentation and > patch descriptions that it can only cause harm with privileged user > space, and that this harm can make things like CMA allocations, memory > onplug, ... fail, which is rather bad and against concepts like > ZONE_MOVABLE/MIGRATE_CMA. > > Although I wonder what would happen if the privileged user space daemon > crashes (e.g., OOM killer?) and simply no longer replies to any messages. > The request is canceled then - that should clear the page/folio state I start to wonder if we should introduce really short fuse request timeouts and just repeat requests when things have cleared up. At least for write-back requests (in the sense that fuse-over-network might be slow or interrupted for some time). Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:30 ` Bernd Schubert @ 2024-12-19 17:37 ` Shakeel Butt 2024-12-19 17:40 ` Bernd Schubert 2024-12-19 17:44 ` Joanne Koong 0 siblings, 2 replies; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 17:37 UTC (permalink / raw) To: Bernd Schubert Cc: David Hildenbrand, Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 06:30:34PM +0100, Bernd Schubert wrote: > > > On 12/19/24 18:26, David Hildenbrand wrote: > > On 19.12.24 18:14, Shakeel Butt wrote: > >> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: > >>> On 19.12.24 17:40, Shakeel Butt wrote: > >>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: > >>>> [...] > >>>>>> > >>>>>> If you check the code just above this patch, this > >>>>>> mapping_writeback_indeterminate() check only happen for pages under > >>>>>> writeback which is a temp state. Anyways, fuse folios should not be > >>>>>> unmovable for their lifetime but only while under writeback which is > >>>>>> same for all fs. > >>>>> > >>>>> But there, writeback is expected to be a temporary thing, not > >>>>> possibly: > >>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. > >>>>> > >>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA > >>>>> guarantees, and unfortunately, it sounds like this is the case > >>>>> here, unless > >>>>> I am missing something important. > >>>>> > >>>> > >>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing > >>>> the confusion. The writeback state is not indefinite. A proper fuse fs, > >>>> like anyother fs, should handle writeback pages appropriately. These > >>>> additional checks and skips are for (I think) untrusted fuse servers. > >>> > >>> Can unprivileged user space provoke this case? > >> > >> Let's ask Joanne and other fuse folks about the above question. > >> > >> Let's say unprivileged user space can start a untrusted fuse server, > >> mount fuse, allocate and dirty a lot of fuse folios (within its dirty > >> and memcg limits) and trigger the writeback. To cause pain (through > >> fragmentation), it is not clearing the writeback state. Is this the > >> scenario you are envisioning? > > > > Yes, for example causing harm on a shared host (containers, ...). > > > > If it cannot happen, we should make it very clear in documentation and > > patch descriptions that it can only cause harm with privileged user > > space, and that this harm can make things like CMA allocations, memory > > onplug, ... fail, which is rather bad and against concepts like > > ZONE_MOVABLE/MIGRATE_CMA. > > > > Although I wonder what would happen if the privileged user space daemon > > crashes (e.g., OOM killer?) and simply no longer replies to any messages. > > > > The request is canceled then - that should clear the page/folio state > > > I start to wonder if we should introduce really short fuse request > timeouts and just repeat requests when things have cleared up. At least > for write-back requests (in the sense that fuse-over-network might > be slow or interrupted for some time). > > Thanks Bernd for the response. Can you tell a bit more about the request timeouts? Basically does it impact/clear the page/folio state as well? ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:37 ` Shakeel Butt @ 2024-12-19 17:40 ` Bernd Schubert 2024-12-19 17:44 ` Joanne Koong 1 sibling, 0 replies; 124+ messages in thread From: Bernd Schubert @ 2024-12-19 17:40 UTC (permalink / raw) To: Shakeel Butt Cc: David Hildenbrand, Zi Yan, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/19/24 18:37, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 06:30:34PM +0100, Bernd Schubert wrote: >> >> >> On 12/19/24 18:26, David Hildenbrand wrote: >>> On 19.12.24 18:14, Shakeel Butt wrote: >>>> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: >>>>> On 19.12.24 17:40, Shakeel Butt wrote: >>>>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: >>>>>> [...] >>>>>>>> >>>>>>>> If you check the code just above this patch, this >>>>>>>> mapping_writeback_indeterminate() check only happen for pages under >>>>>>>> writeback which is a temp state. Anyways, fuse folios should not be >>>>>>>> unmovable for their lifetime but only while under writeback which is >>>>>>>> same for all fs. >>>>>>> >>>>>>> But there, writeback is expected to be a temporary thing, not >>>>>>> possibly: >>>>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. >>>>>>> >>>>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA >>>>>>> guarantees, and unfortunately, it sounds like this is the case >>>>>>> here, unless >>>>>>> I am missing something important. >>>>>>> >>>>>> >>>>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing >>>>>> the confusion. The writeback state is not indefinite. A proper fuse fs, >>>>>> like anyother fs, should handle writeback pages appropriately. These >>>>>> additional checks and skips are for (I think) untrusted fuse servers. >>>>> >>>>> Can unprivileged user space provoke this case? >>>> >>>> Let's ask Joanne and other fuse folks about the above question. >>>> >>>> Let's say unprivileged user space can start a untrusted fuse server, >>>> mount fuse, allocate and dirty a lot of fuse folios (within its dirty >>>> and memcg limits) and trigger the writeback. To cause pain (through >>>> fragmentation), it is not clearing the writeback state. Is this the >>>> scenario you are envisioning? >>> >>> Yes, for example causing harm on a shared host (containers, ...). >>> >>> If it cannot happen, we should make it very clear in documentation and >>> patch descriptions that it can only cause harm with privileged user >>> space, and that this harm can make things like CMA allocations, memory >>> onplug, ... fail, which is rather bad and against concepts like >>> ZONE_MOVABLE/MIGRATE_CMA. >>> >>> Although I wonder what would happen if the privileged user space daemon >>> crashes (e.g., OOM killer?) and simply no longer replies to any messages. >>> >> >> The request is canceled then - that should clear the page/folio state >> >> >> I start to wonder if we should introduce really short fuse request >> timeouts and just repeat requests when things have cleared up. At least >> for write-back requests (in the sense that fuse-over-network might >> be slow or interrupted for some time). >> >> > > Thanks Bernd for the response. Can you tell a bit more about the request > timeouts? Basically does it impact/clear the page/folio state as well? That is just an idea, needs more discussion first. Just sent an off list message. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:37 ` Shakeel Butt 2024-12-19 17:40 ` Bernd Schubert @ 2024-12-19 17:44 ` Joanne Koong 2024-12-19 17:54 ` Shakeel Butt 1 sibling, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-12-19 17:44 UTC (permalink / raw) To: Shakeel Butt Cc: Bernd Schubert, David Hildenbrand, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Dec 19, 2024 at 06:30:34PM +0100, Bernd Schubert wrote: > > > > > > On 12/19/24 18:26, David Hildenbrand wrote: > > > On 19.12.24 18:14, Shakeel Butt wrote: > > >> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: > > >>> On 19.12.24 17:40, Shakeel Butt wrote: > > >>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: > > >>>> [...] > > >>>>>> > > >>>>>> If you check the code just above this patch, this > > >>>>>> mapping_writeback_indeterminate() check only happen for pages under > > >>>>>> writeback which is a temp state. Anyways, fuse folios should not be > > >>>>>> unmovable for their lifetime but only while under writeback which is > > >>>>>> same for all fs. > > >>>>> > > >>>>> But there, writeback is expected to be a temporary thing, not > > >>>>> possibly: > > >>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. > > >>>>> > > >>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA > > >>>>> guarantees, and unfortunately, it sounds like this is the case > > >>>>> here, unless > > >>>>> I am missing something important. > > >>>>> > > >>>> > > >>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing > > >>>> the confusion. The writeback state is not indefinite. A proper fuse fs, > > >>>> like anyother fs, should handle writeback pages appropriately. These > > >>>> additional checks and skips are for (I think) untrusted fuse servers. > > >>> > > >>> Can unprivileged user space provoke this case? > > >> > > >> Let's ask Joanne and other fuse folks about the above question. > > >> > > >> Let's say unprivileged user space can start a untrusted fuse server, > > >> mount fuse, allocate and dirty a lot of fuse folios (within its dirty > > >> and memcg limits) and trigger the writeback. To cause pain (through > > >> fragmentation), it is not clearing the writeback state. Is this the > > >> scenario you are envisioning? > > > > > > Yes, for example causing harm on a shared host (containers, ...). > > > > > > If it cannot happen, we should make it very clear in documentation and > > > patch descriptions that it can only cause harm with privileged user > > > space, and that this harm can make things like CMA allocations, memory > > > onplug, ... fail, which is rather bad and against concepts like > > > ZONE_MOVABLE/MIGRATE_CMA. > > > > > > Although I wonder what would happen if the privileged user space daemon > > > crashes (e.g., OOM killer?) and simply no longer replies to any messages. > > > > > > > The request is canceled then - that should clear the page/folio state > > > > > > I start to wonder if we should introduce really short fuse request > > timeouts and just repeat requests when things have cleared up. At least > > for write-back requests (in the sense that fuse-over-network might > > be slow or interrupted for some time). > > > > > > Thanks Bernd for the response. Can you tell a bit more about the request > timeouts? Basically does it impact/clear the page/folio state as well? Request timeouts can be set by admins system-wide to protect against malicious/buggy fuse servers that do not reply to requests by a certain amount of time. If the request times out, then the whole connection will be aborted, and pages/folios will be cleaned up accordingly. The corresponding patchset is here [1]. This helps mitigate the possibility of unprivileged buggy servers tieing up writeback state by not replying. Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/20241218222630.99920-1-joannelkoong@gmail.com/T/#t ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:44 ` Joanne Koong @ 2024-12-19 17:54 ` Shakeel Butt 2024-12-20 11:44 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 17:54 UTC (permalink / raw) To: Joanne Koong Cc: Bernd Schubert, David Hildenbrand, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 09:44:42AM -0800, Joanne Koong wrote: > On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: [...] > > > > > > The request is canceled then - that should clear the page/folio state > > > > > > > > > I start to wonder if we should introduce really short fuse request > > > timeouts and just repeat requests when things have cleared up. At least > > > for write-back requests (in the sense that fuse-over-network might > > > be slow or interrupted for some time). > > > > > > > > > > Thanks Bernd for the response. Can you tell a bit more about the request > > timeouts? Basically does it impact/clear the page/folio state as well? > > Request timeouts can be set by admins system-wide to protect against > malicious/buggy fuse servers that do not reply to requests by a > certain amount of time. If the request times out, then the whole > connection will be aborted, and pages/folios will be cleaned up > accordingly. The corresponding patchset is here [1]. This helps > mitigate the possibility of unprivileged buggy servers tieing up > writeback state by not replying. > Thanks a lot Joanne and Bernd. David, does these timeouts resolve your concerns? ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:54 ` Shakeel Butt @ 2024-12-20 11:44 ` David Hildenbrand 2024-12-20 12:15 ` Bernd Schubert 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-20 11:44 UTC (permalink / raw) To: Shakeel Butt, Joanne Koong Cc: Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 19.12.24 18:54, Shakeel Butt wrote: > On Thu, Dec 19, 2024 at 09:44:42AM -0800, Joanne Koong wrote: >> On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > [...] >>>> >>>> The request is canceled then - that should clear the page/folio state >>>> >>>> >>>> I start to wonder if we should introduce really short fuse request >>>> timeouts and just repeat requests when things have cleared up. At least >>>> for write-back requests (in the sense that fuse-over-network might >>>> be slow or interrupted for some time). >>>> >>>> >>> >>> Thanks Bernd for the response. Can you tell a bit more about the request >>> timeouts? Basically does it impact/clear the page/folio state as well? >> >> Request timeouts can be set by admins system-wide to protect against >> malicious/buggy fuse servers that do not reply to requests by a >> certain amount of time. If the request times out, then the whole >> connection will be aborted, and pages/folios will be cleaned up >> accordingly. The corresponding patchset is here [1]. This helps >> mitigate the possibility of unprivileged buggy servers tieing up >> writeback state by not replying. >> > > Thanks a lot Joanne and Bernd. > > David, does these timeouts resolve your concerns? Thanks for that information. Yes and no. :) Bernd wrote: "I start to wonder if we should introduce really short fuse request timeouts and just repeat requests when things have cleared up. At least for write-back requests (in the sense that fuse-over-network might be slow or interrupted for some time). Indicating to me that while timeouts might be supported soon (will there be a sane default?) even trusted implementations can run into this (network example above) where timeouts might actually be harmful I suppose? I'm wondering if there would be a way to just "cancel" the writeback and mark the folio dirty again. That way it could be migrated, but not reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE thing. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 11:44 ` David Hildenbrand @ 2024-12-20 12:15 ` Bernd Schubert 2024-12-20 14:49 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-20 12:15 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt, Joanne Koong Cc: Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/20/24 12:44, David Hildenbrand wrote: > On 19.12.24 18:54, Shakeel Butt wrote: >> On Thu, Dec 19, 2024 at 09:44:42AM -0800, Joanne Koong wrote: >>> On Thu, Dec 19, 2024 at 9:37 AM Shakeel Butt <shakeel.butt@linux.dev> >>> wrote: >> [...] >>>>> >>>>> The request is canceled then - that should clear the page/folio state >>>>> >>>>> >>>>> I start to wonder if we should introduce really short fuse request >>>>> timeouts and just repeat requests when things have cleared up. At >>>>> least >>>>> for write-back requests (in the sense that fuse-over-network might >>>>> be slow or interrupted for some time). >>>>> >>>>> >>>> >>>> Thanks Bernd for the response. Can you tell a bit more about the >>>> request >>>> timeouts? Basically does it impact/clear the page/folio state as well? >>> >>> Request timeouts can be set by admins system-wide to protect against >>> malicious/buggy fuse servers that do not reply to requests by a >>> certain amount of time. If the request times out, then the whole >>> connection will be aborted, and pages/folios will be cleaned up >>> accordingly. The corresponding patchset is here [1]. This helps >>> mitigate the possibility of unprivileged buggy servers tieing up >>> writeback state by not replying. >>> >> >> Thanks a lot Joanne and Bernd. >> >> David, does these timeouts resolve your concerns? > > Thanks for that information. Yes and no. :) > > Bernd wrote: "I start to wonder if we should introduce really short fuse > request timeouts and just repeat requests when things have cleared up. > At least for write-back requests (in the sense that fuse-over-network > might be slow or interrupted for some time). > > Indicating to me that while timeouts might be supported soon (will there > be a sane default?) even trusted implementations can run into this > (network example above) where timeouts might actually be harmful I suppose? Yeah and that makes it hard to provide a default. In Joannes timeout patches the admin can set a system default. https://lore.kernel.org/all/20241218222630.99920-3-joannelkoong@gmail.com/ > > I'm wondering if there would be a way to just "cancel" the writeback and > mark the folio dirty again. That way it could be migrated, but not > reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE > thing. > That is what I basically meant with short timeouts. Obviously it is not that simple to cancel the request and to retry - it would add in quite some complexity, if all the issues that arise can be solved at all. Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 12:15 ` Bernd Schubert @ 2024-12-20 14:49 ` David Hildenbrand 2024-12-20 15:26 ` Bernd Schubert ` (2 more replies) 0 siblings, 3 replies; 124+ messages in thread From: David Hildenbrand @ 2024-12-20 14:49 UTC (permalink / raw) To: Bernd Schubert, Shakeel Butt, Joanne Koong Cc: Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko >> I'm wondering if there would be a way to just "cancel" the writeback and >> mark the folio dirty again. That way it could be migrated, but not >> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE >> thing. >> > > That is what I basically meant with short timeouts. Obviously it is not > that simple to cancel the request and to retry - it would add in quite > some complexity, if all the issues that arise can be solved at all. At least it would keep that out of core-mm. AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to improve such scenarios, not acknowledge and integrate them, then work around using timeouts that must be manually configured, and ca likely no be default enabled because it could hurt reasonable use cases :( Right now we clear the writeback flag immediately, indicating that data was written back, when in fact it was not written back at all. I suspect fsync() currently handles that manually already, to wait for any of the allocated pages to actually get written back by user space, so we have control over when something was *actually* written back. Similar to your proposal, I wonder if there could be a way to request fuse to "abort" a writeback request (instead of using fixed timeouts per request). Meaning, when we stumble over a folio that is under writeback on some paths, we would tell fuse to "end writeback now", or "end writeback now if it takes longer than X". Essentially hidden inside folio_wait_writeback(). When aborting a request, as I said, we would essentially "end writeback" and mark the folio as dirty again. The interesting thing is likely how to handle user space that wants to process this request right now (stuck in fuse_send_writepage() I assume?), correct? Just throwing it out there ... no expert at all on fuse ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 14:49 ` David Hildenbrand @ 2024-12-20 15:26 ` Bernd Schubert 2024-12-20 18:01 ` Shakeel Butt 2024-12-20 21:01 ` Joanne Koong 2 siblings, 0 replies; 124+ messages in thread From: Bernd Schubert @ 2024-12-20 15:26 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt, Joanne Koong Cc: Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/20/24 15:49, David Hildenbrand wrote: >>> I'm wondering if there would be a way to just "cancel" the writeback and >>> mark the folio dirty again. That way it could be migrated, but not >>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE >>> thing. >>> >> >> That is what I basically meant with short timeouts. Obviously it is not >> that simple to cancel the request and to retry - it would add in quite >> some complexity, if all the issues that arise can be solved at all. > > At least it would keep that out of core-mm. > > AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should > try to improve such scenarios, not acknowledge and integrate them, then > work around using timeouts that must be manually configured, and ca > likely no be default enabled because it could hurt reasonable use cases :( > > Right now we clear the writeback flag immediately, indicating that data > was written back, when in fact it was not written back at all. I suspect > fsync() currently handles that manually already, to wait for any of the > allocated pages to actually get written back by user space, so we have > control over when something was *actually* written back. Yeah, fuse_writepage_end() decreases fi->writectr, which gets checked by fsync. Knowing when somethings has been written back is not the issue, but keeping order, handling splice, possible double write to the same range (it should be mostly idempotent, but is that guaranteed by all servers), etc. > > > Similar to your proposal, I wonder if there could be a way to request > fuse to "abort" a writeback request (instead of using fixed timeouts per > request). Meaning, when we stumble over a folio that is under writeback > on some paths, we would tell fuse to "end writeback now", or "end > writeback now if it takes longer than X". Essentially hidden inside > folio_wait_writeback(). Yeah, that would be a minor improvement to the overall issue ;) Re-queue issue. > > When aborting a request, as I said, we would essentially "end writeback" > and mark the folio as dirty again. The interesting thing is likely how > to handle user space that wants to process this request right now (stuck > in fuse_send_writepage() I assume?), correct? That sends background requests - does not get stuck. Completion happens in fuse_writepage_end(), when the request reply is received. Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 14:49 ` David Hildenbrand 2024-12-20 15:26 ` Bernd Schubert @ 2024-12-20 18:01 ` Shakeel Butt 2024-12-21 2:28 ` Jingbo Xu 2024-12-21 16:18 ` David Hildenbrand 2024-12-20 21:01 ` Joanne Koong 2 siblings, 2 replies; 124+ messages in thread From: Shakeel Butt @ 2024-12-20 18:01 UTC (permalink / raw) To: David Hildenbrand Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote: > > > I'm wondering if there would be a way to just "cancel" the writeback and > > > mark the folio dirty again. That way it could be migrated, but not > > > reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE > > > thing. > > > > > > > That is what I basically meant with short timeouts. Obviously it is not > > that simple to cancel the request and to retry - it would add in quite > > some complexity, if all the issues that arise can be solved at all. > > At least it would keep that out of core-mm. > > AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to > improve such scenarios, not acknowledge and integrate them, then work around > using timeouts that must be manually configured, and ca likely no be default > enabled because it could hurt reasonable use cases :( Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm parts. First is reclaim and second is compaction/migration. For reclaim, it is a must have as explained by Jingbo in [1] i.e. due to potential self deadlock by fuse server. If I understand you correctly, the main concern you have is its usage in the second case. The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was to avoid untrusted fuse server causing pain to unrelated jobs on the machine (fuse folks please correct me if I am wrong here). Now we are discussing how to better handle that scenario. I just wanted to point out that irrespective of that discussion, the reclaim will have handle the potential recursive deadlock and thus will be using AS_WRITEBACK_INDETERMINATE or something similar. [1] https://lore.kernel.org/all/d48ae58e-500f-4ef1-bc6f-a41a8f5b94bf@linux.alibaba.com/ ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 18:01 ` Shakeel Butt @ 2024-12-21 2:28 ` Jingbo Xu 2024-12-21 16:23 ` David Hildenbrand 2024-12-21 16:18 ` David Hildenbrand 1 sibling, 1 reply; 124+ messages in thread From: Jingbo Xu @ 2024-12-21 2:28 UTC (permalink / raw) To: Shakeel Butt, David Hildenbrand Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/21/24 2:01 AM, Shakeel Butt wrote: > On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote: >>>> I'm wondering if there would be a way to just "cancel" the writeback and >>>> mark the folio dirty again. That way it could be migrated, but not >>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE >>>> thing. >>>> >>> >>> That is what I basically meant with short timeouts. Obviously it is not >>> that simple to cancel the request and to retry - it would add in quite >>> some complexity, if all the issues that arise can be solved at all. >> >> At least it would keep that out of core-mm. >> >> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to >> improve such scenarios, not acknowledge and integrate them, then work around >> using timeouts that must be manually configured, and ca likely no be default >> enabled because it could hurt reasonable use cases :( > > Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm > parts. First is reclaim and second is compaction/migration. For reclaim, > it is a must have as explained by Jingbo in [1] i.e. due to potential > self deadlock by fuse server. If I understand you correctly, the main > concern you have is its usage in the second case. > > The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was > to avoid untrusted fuse server causing pain to unrelated jobs on the > machine (fuse folks please correct me if I am wrong here). Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the memory allocation path, i.e. the fuse server itself won't stumble into MIGRATE_SYNC migration. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-21 2:28 ` Jingbo Xu @ 2024-12-21 16:23 ` David Hildenbrand 2024-12-22 2:47 ` Jingbo Xu 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-21 16:23 UTC (permalink / raw) To: Jingbo Xu, Shakeel Butt Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 21.12.24 03:28, Jingbo Xu wrote: > > > On 12/21/24 2:01 AM, Shakeel Butt wrote: >> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote: >>>>> I'm wondering if there would be a way to just "cancel" the writeback and >>>>> mark the folio dirty again. That way it could be migrated, but not >>>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE >>>>> thing. >>>>> >>>> >>>> That is what I basically meant with short timeouts. Obviously it is not >>>> that simple to cancel the request and to retry - it would add in quite >>>> some complexity, if all the issues that arise can be solved at all. >>> >>> At least it would keep that out of core-mm. >>> >>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to >>> improve such scenarios, not acknowledge and integrate them, then work around >>> using timeouts that must be manually configured, and ca likely no be default >>> enabled because it could hurt reasonable use cases :( >> >> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm >> parts. First is reclaim and second is compaction/migration. For reclaim, >> it is a must have as explained by Jingbo in [1] i.e. due to potential >> self deadlock by fuse server. If I understand you correctly, the main >> concern you have is its usage in the second case. >> >> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was >> to avoid untrusted fuse server causing pain to unrelated jobs on the >> machine (fuse folks please correct me if I am wrong here). > > Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the > memory allocation path, i.e. the fuse server itself won't stumble into > MIGRATE_SYNC migration. > Maybe memory compaction (on higher-order allocations only) could trigger it? gfp_compaction_allowed() checks __GFP_IO. GFP_KERNEL includes that. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-21 16:23 ` David Hildenbrand @ 2024-12-22 2:47 ` Jingbo Xu 2024-12-24 11:32 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Jingbo Xu @ 2024-12-22 2:47 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/22/24 12:23 AM, David Hildenbrand wrote: > On 21.12.24 03:28, Jingbo Xu wrote: >> >> >> On 12/21/24 2:01 AM, Shakeel Butt wrote: >>> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote: >>>>>> I'm wondering if there would be a way to just "cancel" the >>>>>> writeback and >>>>>> mark the folio dirty again. That way it could be migrated, but not >>>>>> reclaimed. At least we could avoid the whole >>>>>> AS_WRITEBACK_INDETERMINATE >>>>>> thing. >>>>>> >>>>> >>>>> That is what I basically meant with short timeouts. Obviously it is >>>>> not >>>>> that simple to cancel the request and to retry - it would add in quite >>>>> some complexity, if all the issues that arise can be solved at all. >>>> >>>> At least it would keep that out of core-mm. >>>> >>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we >>>> should try to >>>> improve such scenarios, not acknowledge and integrate them, then >>>> work around >>>> using timeouts that must be manually configured, and ca likely no be >>>> default >>>> enabled because it could hurt reasonable use cases :( >>> >>> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm >>> parts. First is reclaim and second is compaction/migration. For reclaim, >>> it is a must have as explained by Jingbo in [1] i.e. due to potential >>> self deadlock by fuse server. If I understand you correctly, the main >>> concern you have is its usage in the second case. >>> >>> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was >>> to avoid untrusted fuse server causing pain to unrelated jobs on the >>> machine (fuse folks please correct me if I am wrong here). >> >> Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the >> memory allocation path, i.e. the fuse server itself won't stumble into >> MIGRATE_SYNC migration. >> > > Maybe memory compaction (on higher-order allocations only) could trigger > it? > > gfp_compaction_allowed() checks __GFP_IO. GFP_KERNEL includes that. > But that (memory compaction on memory allocation, which can be triggered in the fuse server process context) only triggers MIGRATE_SYNC_LIGHT, which won't wait for writeback. AFAICS, MIGRATE_SYNC can be triggered during cma allocation, memory offline, or node compaction manually through sysctl. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-22 2:47 ` Jingbo Xu @ 2024-12-24 11:32 ` David Hildenbrand 0 siblings, 0 replies; 124+ messages in thread From: David Hildenbrand @ 2024-12-24 11:32 UTC (permalink / raw) To: Jingbo Xu, Shakeel Butt Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 22.12.24 03:47, Jingbo Xu wrote: > > > On 12/22/24 12:23 AM, David Hildenbrand wrote: >> On 21.12.24 03:28, Jingbo Xu wrote: >>> >>> >>> On 12/21/24 2:01 AM, Shakeel Butt wrote: >>>> On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote: >>>>>>> I'm wondering if there would be a way to just "cancel" the >>>>>>> writeback and >>>>>>> mark the folio dirty again. That way it could be migrated, but not >>>>>>> reclaimed. At least we could avoid the whole >>>>>>> AS_WRITEBACK_INDETERMINATE >>>>>>> thing. >>>>>>> >>>>>> >>>>>> That is what I basically meant with short timeouts. Obviously it is >>>>>> not >>>>>> that simple to cancel the request and to retry - it would add in quite >>>>>> some complexity, if all the issues that arise can be solved at all. >>>>> >>>>> At least it would keep that out of core-mm. >>>>> >>>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we >>>>> should try to >>>>> improve such scenarios, not acknowledge and integrate them, then >>>>> work around >>>>> using timeouts that must be manually configured, and ca likely no be >>>>> default >>>>> enabled because it could hurt reasonable use cases :( >>>> >>>> Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm >>>> parts. First is reclaim and second is compaction/migration. For reclaim, >>>> it is a must have as explained by Jingbo in [1] i.e. due to potential >>>> self deadlock by fuse server. If I understand you correctly, the main >>>> concern you have is its usage in the second case. >>>> >>>> The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was >>>> to avoid untrusted fuse server causing pain to unrelated jobs on the >>>> machine (fuse folks please correct me if I am wrong here). >>> >>> Right, IIUC direct MIGRATE_SYNC migration won't be triggered on the >>> memory allocation path, i.e. the fuse server itself won't stumble into >>> MIGRATE_SYNC migration. >>> >> >> Maybe memory compaction (on higher-order allocations only) could trigger >> it? >> >> gfp_compaction_allowed() checks __GFP_IO. GFP_KERNEL includes that. >> > > But that (memory compaction on memory allocation, which can be triggered > in the fuse server process context) only triggers MIGRATE_SYNC_LIGHT, > which won't wait for writeback. > Ah, that makes sense. > AFAICS, MIGRATE_SYNC can be triggered during cma allocation, memory > offline, or node compaction manually through sysctl. Right, non-proactive compaction always uses MIGRATE_SYNC_LIGHT, that won't wait. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 18:01 ` Shakeel Butt 2024-12-21 2:28 ` Jingbo Xu @ 2024-12-21 16:18 ` David Hildenbrand 2024-12-23 22:14 ` Shakeel Butt 1 sibling, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-21 16:18 UTC (permalink / raw) To: Shakeel Butt Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 20.12.24 19:01, Shakeel Butt wrote: > On Fri, Dec 20, 2024 at 03:49:39PM +0100, David Hildenbrand wrote: >>>> I'm wondering if there would be a way to just "cancel" the writeback and >>>> mark the folio dirty again. That way it could be migrated, but not >>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE >>>> thing. >>>> >>> >>> That is what I basically meant with short timeouts. Obviously it is not >>> that simple to cancel the request and to retry - it would add in quite >>> some complexity, if all the issues that arise can be solved at all. >> >> At least it would keep that out of core-mm. >> >> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should try to >> improve such scenarios, not acknowledge and integrate them, then work around >> using timeouts that must be manually configured, and ca likely no be default >> enabled because it could hurt reasonable use cases :( > > Just to be clear AS_WRITEBACK_INDETERMINATE is being used in two core-mm > parts. First is reclaim and second is compaction/migration. For reclaim, > it is a must have as explained by Jingbo in [1] i.e. due to potential > self deadlock by fuse server. If I understand you correctly, the main > concern you have is its usage in the second case. Yes, so I can see fuse (1) Breaking memory reclaim (memory cannot get freed up) (2) Breaking page migration (memory cannot be migrated) Due to (1) we might experience bigger memory pressure in the system I guess. A handful of these pages don't really hurt, I have no idea how bad having many of these pages can be. But yes, inherently we cannot throw away the data as long as it is dirty without causing harm. (maybe we could move it to some other cache, like swap/zswap; but that smells like a big and complicated project) Due to (2) we turn pages that are supposed to be movable possibly for a long time unmovable. Even a *single* such page will mean that CMA allocations / memory unplug can start failing. We have similar situations with page pinning. With things like O_DIRECT, our assumption/experience so far is that it will only take a couple of seconds max, and retry loops are sufficient to handle it. That's why only long-term pinning ("indeterminate", e.g., vfio) migrate these pages out of ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. The biggest concern I have is that timeouts, while likely reasonable it many scenarios, might not be desirable even for some sane workloads, and the default in all system will be "no timeout", letting the clueless admin of each and every system out there that might support fuse to make a decision. I might have misunderstood something, in which case I am very sorry, but we also don't want CMA allocations to start failing simply because a network connection is down for a couple of minutes such that a fuse daemon cannot make progress. > > The reason for adding AS_WRITEBACK_INDETERMINATE in the second case was > to avoid untrusted fuse server causing pain to unrelated jobs on the > machine (fuse folks please correct me if I am wrong here). Now we are > discussing how to better handle that scenario. > > I just wanted to point out that irrespective of that discussion, the > reclaim will have handle the potential recursive deadlock and thus will > be using AS_WRITEBACK_INDETERMINATE or something similar. Yes, I see no way to throw away dirty data without causing harm. Migration was kept working for now, although in a hacky fashion I admit. I do enjoy that "writeback" on the folio actually matches the reality now. I guess an alternative to "aborting writeback" would be to make fuse allow for migrating folios that are under writeback. I would assume that with fuse we have very good control over who is currently reading/writing that folio, and we could swap it out? Again, just an idea ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-21 16:18 ` David Hildenbrand @ 2024-12-23 22:14 ` Shakeel Butt 2024-12-24 12:37 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-23 22:14 UTC (permalink / raw) To: David Hildenbrand Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: [...] > > Yes, so I can see fuse > > (1) Breaking memory reclaim (memory cannot get freed up) > > (2) Breaking page migration (memory cannot be migrated) > > Due to (1) we might experience bigger memory pressure in the system I guess. > A handful of these pages don't really hurt, I have no idea how bad having > many of these pages can be. But yes, inherently we cannot throw away the > data as long as it is dirty without causing harm. (maybe we could move it to > some other cache, like swap/zswap; but that smells like a big and > complicated project) > > Due to (2) we turn pages that are supposed to be movable possibly for a long > time unmovable. Even a *single* such page will mean that CMA allocations / > memory unplug can start failing. > > We have similar situations with page pinning. With things like O_DIRECT, our > assumption/experience so far is that it will only take a couple of seconds > max, and retry loops are sufficient to handle it. That's why only long-term > pinning ("indeterminate", e.g., vfio) migrate these pages out of > ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. > > > The biggest concern I have is that timeouts, while likely reasonable it many > scenarios, might not be desirable even for some sane workloads, and the > default in all system will be "no timeout", letting the clueless admin of > each and every system out there that might support fuse to make a decision. > > I might have misunderstood something, in which case I am very sorry, but we > also don't want CMA allocations to start failing simply because a network > connection is down for a couple of minutes such that a fuse daemon cannot > make progress. > I think you have valid concerns but these are not new and not unique to fuse. Any filesystem with a potential arbitrary stall can have similar issues. The arbitrary stall can be caused due to network issues or some faultly local storage. Regarding the reclaim, I wouldn't say fuse or similar filesystem are breaking memory reclaim as the kernel has mechanism to throttle the threads dirtying the file memory to reduce the chance of situations where most of memory becomes unreclaimable due to being dirty. Please note that such filesystems are mostly used in environments like data center or hyperscalar and usually have more advanced mechanisms to handle and avoid situations like long delays. For such environment network unavailability is a larger issue than some cma allocation failure. My point is: let's not assume the disastrous situaion is normal and overcomplicate the solution. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-23 22:14 ` Shakeel Butt @ 2024-12-24 12:37 ` David Hildenbrand 2024-12-26 15:11 ` Zi Yan 2024-12-26 20:13 ` Shakeel Butt 0 siblings, 2 replies; 124+ messages in thread From: David Hildenbrand @ 2024-12-24 12:37 UTC (permalink / raw) To: Shakeel Butt Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 23.12.24 23:14, Shakeel Butt wrote: > On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: > [...] >> >> Yes, so I can see fuse >> >> (1) Breaking memory reclaim (memory cannot get freed up) >> >> (2) Breaking page migration (memory cannot be migrated) >> >> Due to (1) we might experience bigger memory pressure in the system I guess. >> A handful of these pages don't really hurt, I have no idea how bad having >> many of these pages can be. But yes, inherently we cannot throw away the >> data as long as it is dirty without causing harm. (maybe we could move it to >> some other cache, like swap/zswap; but that smells like a big and >> complicated project) >> >> Due to (2) we turn pages that are supposed to be movable possibly for a long >> time unmovable. Even a *single* such page will mean that CMA allocations / >> memory unplug can start failing. >> >> We have similar situations with page pinning. With things like O_DIRECT, our >> assumption/experience so far is that it will only take a couple of seconds >> max, and retry loops are sufficient to handle it. That's why only long-term >> pinning ("indeterminate", e.g., vfio) migrate these pages out of >> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. >> >> >> The biggest concern I have is that timeouts, while likely reasonable it many >> scenarios, might not be desirable even for some sane workloads, and the >> default in all system will be "no timeout", letting the clueless admin of >> each and every system out there that might support fuse to make a decision. >> >> I might have misunderstood something, in which case I am very sorry, but we >> also don't want CMA allocations to start failing simply because a network >> connection is down for a couple of minutes such that a fuse daemon cannot >> make progress. >> > > I think you have valid concerns but these are not new and not unique to > fuse. Any filesystem with a potential arbitrary stall can have similar > issues. The arbitrary stall can be caused due to network issues or some > faultly local storage. What concerns me more is that this is can be triggered by even unprivileged user space, and that there is no default protection as far as I understood, because timeouts cannot be set universally to a sane defaults. Again, please correct me if I got that wrong. BTW, I just looked at NFS out of interest, in particular nfs_page_async_flush(), and I spot some logic about re-dirtying pages + canceling writeback. IIUC, there are default timeouts for UDP and TCP, whereby the TCP default one seems to be around 60s (* retrans?), and the privileged user that mounts it can set higher ones. I guess one could run into similar writeback issues? So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? Not sure if I grasped all details about NFS and writeback and when it would redirty+end writeback, and if there is some other handling in there. > > Regarding the reclaim, I wouldn't say fuse or similar filesystem are > breaking memory reclaim as the kernel has mechanism to throttle the > threads dirtying the file memory to reduce the chance of situations > where most of memory becomes unreclaimable due to being dirty. Yes, likely even cgroups can easily limit the amount. > > Please note that such filesystems are mostly used in environments like > data center or hyperscalar and usually have more advanced mechanisms to > handle and avoid situations like long delays. For such environment > network unavailability is a larger issue than some cma allocation > failure. My point is: let's not assume the disastrous situaion is normal > and overcomplicate the solution. Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used for movable allocations. Mechanisms that possible turn these folios unmovable for a long/indeterminate time must either fail or migrate these folios out of these regions, otherwise we start violating the very semantics why ZONE_MOVABLE/MIGRATE_CMA was added in the first place. Yes, there are corner cases where we cannot guarantee movability (e.g., OOM when allocating a migration destination), but these are not cases that can be triggered by (unprivileged) user space easily. That's why FOLL_LONGTERM pinning does exactly that: even if user space would promise that this is really only "short-term", we will treat it as "possibly forever", because it's under user-space control. Instead of having more subsystems violate these semantics because "performance" ... I would hope we would do better. Maybe it's an issue for NFS as well ("at least" only for privileged user space)? In which case, again, I would hope we would do better. Anyhow, I'm hoping there will be more feedback from other MM folks, but likely right now a lot of people are out (just like I should ;) ). If I end up being the only one with these concerns, then likely people can feel free to ignore them. ;) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-24 12:37 ` David Hildenbrand @ 2024-12-26 15:11 ` Zi Yan 2024-12-26 20:13 ` Shakeel Butt 1 sibling, 0 replies; 124+ messages in thread From: Zi Yan @ 2024-12-26 15:11 UTC (permalink / raw) To: David Hildenbrand Cc: Shakeel Butt, Bernd Schubert, Joanne Koong, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 24 Dec 2024, at 7:37, David Hildenbrand wrote: > On 23.12.24 23:14, Shakeel Butt wrote: >> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: >> [...] >>> >>> Yes, so I can see fuse >>> >>> (1) Breaking memory reclaim (memory cannot get freed up) >>> >>> (2) Breaking page migration (memory cannot be migrated) >>> >>> Due to (1) we might experience bigger memory pressure in the system I guess. >>> A handful of these pages don't really hurt, I have no idea how bad having >>> many of these pages can be. But yes, inherently we cannot throw away the >>> data as long as it is dirty without causing harm. (maybe we could move it to >>> some other cache, like swap/zswap; but that smells like a big and >>> complicated project) >>> >>> Due to (2) we turn pages that are supposed to be movable possibly for a long >>> time unmovable. Even a *single* such page will mean that CMA allocations / >>> memory unplug can start failing. >>> >>> We have similar situations with page pinning. With things like O_DIRECT, our >>> assumption/experience so far is that it will only take a couple of seconds >>> max, and retry loops are sufficient to handle it. That's why only long-term >>> pinning ("indeterminate", e.g., vfio) migrate these pages out of >>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. >>> >>> >>> The biggest concern I have is that timeouts, while likely reasonable it many >>> scenarios, might not be desirable even for some sane workloads, and the >>> default in all system will be "no timeout", letting the clueless admin of >>> each and every system out there that might support fuse to make a decision. >>> >>> I might have misunderstood something, in which case I am very sorry, but we >>> also don't want CMA allocations to start failing simply because a network >>> connection is down for a couple of minutes such that a fuse daemon cannot >>> make progress. >>> >> >> I think you have valid concerns but these are not new and not unique to >> fuse. Any filesystem with a potential arbitrary stall can have similar >> issues. The arbitrary stall can be caused due to network issues or some >> faultly local storage. > > What concerns me more is that this is can be triggered by even unprivileged user space, and that there is no default protection as far as I understood, because timeouts cannot be set universally to a sane defaults. > > Again, please correct me if I got that wrong. > > > BTW, I just looked at NFS out of interest, in particular nfs_page_async_flush(), and I spot some logic about re-dirtying pages + canceling writeback. IIUC, there are default timeouts for UDP and TCP, whereby the TCP default one seems to be around 60s (* retrans?), and the privileged user that mounts it can set higher ones. I guess one could run into similar writeback issues? > > So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? Not sure if I grasped all details about NFS and writeback and when it would redirty+end writeback, and if there is some other handling in there. > >> >> Regarding the reclaim, I wouldn't say fuse or similar filesystem are >> breaking memory reclaim as the kernel has mechanism to throttle the >> threads dirtying the file memory to reduce the chance of situations >> where most of memory becomes unreclaimable due to being dirty. > > Yes, likely even cgroups can easily limit the amount. > >> >> Please note that such filesystems are mostly used in environments like >> data center or hyperscalar and usually have more advanced mechanisms to >> handle and avoid situations like long delays. For such environment >> network unavailability is a larger issue than some cma allocation >> failure. My point is: let's not assume the disastrous situaion is normal >> and overcomplicate the solution. > > Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used for movable allocations. Exactly this. > > Mechanisms that possible turn these folios unmovable for a long/indeterminate time must either fail or migrate these folios out of these regions, otherwise we start violating the very semantics why ZONE_MOVABLE/MIGRATE_CMA was added in the first place. Totally agree. > > Yes, there are corner cases where we cannot guarantee movability (e.g., OOM when allocating a migration destination), but these are not cases that can be triggered by (unprivileged) user space easily. > > That's why FOLL_LONGTERM pinning does exactly that: even if user space would promise that this is really only "short-term", we will treat it as "possibly forever", because it's under user-space control. > > > Instead of having more subsystems violate these semantics because "performance" ... I would hope we would do better. Maybe it's an issue for NFS as well ("at least" only for privileged user space)? In which case, again, I would hope we would do better. Another issue with the proposed AS_WRITEBACK_INDETERMINATE approach is that FUSE used to use temp pages from MIGRATE_UNMOVABLE to write back dirty pages, which confines these unmovable pages within certain pageblocks, but now any dirty page can become unmovable due to AS_WRITEBACK_INDETERMINATE and they can spread across the entire physical space. This means memory can be fragmented much easier, namely with the same 512 dirty pages, previously, all could be confined in 1 pageblock, but now in the worse scenario they can appear in 512 pageblocks. -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-24 12:37 ` David Hildenbrand 2024-12-26 15:11 ` Zi Yan @ 2024-12-26 20:13 ` Shakeel Butt 2024-12-26 22:02 ` Bernd Schubert ` (2 more replies) 1 sibling, 3 replies; 124+ messages in thread From: Shakeel Butt @ 2024-12-26 20:13 UTC (permalink / raw) To: David Hildenbrand Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote: > On 23.12.24 23:14, Shakeel Butt wrote: > > On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: > > [...] > > > > > > Yes, so I can see fuse > > > > > > (1) Breaking memory reclaim (memory cannot get freed up) > > > > > > (2) Breaking page migration (memory cannot be migrated) > > > > > > Due to (1) we might experience bigger memory pressure in the system I guess. > > > A handful of these pages don't really hurt, I have no idea how bad having > > > many of these pages can be. But yes, inherently we cannot throw away the > > > data as long as it is dirty without causing harm. (maybe we could move it to > > > some other cache, like swap/zswap; but that smells like a big and > > > complicated project) > > > > > > Due to (2) we turn pages that are supposed to be movable possibly for a long > > > time unmovable. Even a *single* such page will mean that CMA allocations / > > > memory unplug can start failing. > > > > > > We have similar situations with page pinning. With things like O_DIRECT, our > > > assumption/experience so far is that it will only take a couple of seconds > > > max, and retry loops are sufficient to handle it. That's why only long-term > > > pinning ("indeterminate", e.g., vfio) migrate these pages out of > > > ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. > > > > > > > > > The biggest concern I have is that timeouts, while likely reasonable it many > > > scenarios, might not be desirable even for some sane workloads, and the > > > default in all system will be "no timeout", letting the clueless admin of > > > each and every system out there that might support fuse to make a decision. > > > > > > I might have misunderstood something, in which case I am very sorry, but we > > > also don't want CMA allocations to start failing simply because a network > > > connection is down for a couple of minutes such that a fuse daemon cannot > > > make progress. > > > > > > > I think you have valid concerns but these are not new and not unique to > > fuse. Any filesystem with a potential arbitrary stall can have similar > > issues. The arbitrary stall can be caused due to network issues or some > > faultly local storage. > > What concerns me more is that this is can be triggered by even unprivileged > user space, and that there is no default protection as far as I understood, > because timeouts cannot be set universally to a sane defaults. > > Again, please correct me if I got that wrong. > Let's route this question to FUSE folks. More specifically: can an unprivileged process create a mount point backed by itself, create a lot of dirty (bound by cgroup) and writeback pages on it and let the writeback pages in that state forever? > > BTW, I just looked at NFS out of interest, in particular > nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > canceling writeback. IIUC, there are default timeouts for UDP and TCP, > whereby the TCP default one seems to be around 60s (* retrans?), and the > privileged user that mounts it can set higher ones. I guess one could run > into similar writeback issues? Yes, I think so. > > So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? I feel like INDETERMINATE in the name is the main cause of confusion. So, let me explain why it is required (but later I will tell you how it can be avoided). The FUSE thread which is actively handling writeback of a given folio can cause memory allocation either through syscall or page fault. That memory allocation can trigger global reclaim synchronously and in cgroup-v1, that FUSE thread can wait on the writeback on the same folio whose writeback it is supposed to end and cauing a deadlock. So, AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. The in-kernel fs avoid this situation through the use of GFP_NOFS allocations. The userspace fs can also use a similar approach which is prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been told that it is hard to use as it is per-thread flag and has to be set for all the threads handling writeback which can be error prone if the threadpool is dynamic. Second it is very coarse such that all the allocations from those threads (e.g. page faults) become NOFS which makes userspace very unreliable on highly utilized machine as NOFS can not reclaim potentially a lot of memory and can not trigger oom-kill. > Not > sure if I grasped all details about NFS and writeback and when it would > redirty+end writeback, and if there is some other handling in there. > [...] > > > > Please note that such filesystems are mostly used in environments like > > data center or hyperscalar and usually have more advanced mechanisms to > > handle and avoid situations like long delays. For such environment > > network unavailability is a larger issue than some cma allocation > > failure. My point is: let's not assume the disastrous situaion is normal > > and overcomplicate the solution. > > Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used > for movable allocations. > > Mechanisms that possible turn these folios unmovable for a > long/indeterminate time must either fail or migrate these folios out of > these regions, otherwise we start violating the very semantics why > ZONE_MOVABLE/MIGRATE_CMA was added in the first place. > > Yes, there are corner cases where we cannot guarantee movability (e.g., OOM > when allocating a migration destination), but these are not cases that can > be triggered by (unprivileged) user space easily. > > That's why FOLL_LONGTERM pinning does exactly that: even if user space would > promise that this is really only "short-term", we will treat it as "possibly > forever", because it's under user-space control. > > > Instead of having more subsystems violate these semantics because > "performance" ... I would hope we would do better. Maybe it's an issue for > NFS as well ("at least" only for privileged user space)? In which case, > again, I would hope we would do better. > > > Anyhow, I'm hoping there will be more feedback from other MM folks, but > likely right now a lot of people are out (just like I should ;) ). > > If I end up being the only one with these concerns, then likely people can > feel free to ignore them. ;) I agree we should do better but IMHO it should be an iterative process. I think your concerns are valid, so let's push the discussion towards resolving those concerns. I think the concerns can be resolved by better handling of lifetime of folios under writeback. The amount of such folios is already handled through existing dirty throttling mechanism. We should start with a baseline i.e. distribution of lifetime of folios under writeback for traditional storage devices (spinning disk and SSDs) as we don't want an unrealistic goal for ourself. I think this data will drive the appropriate timeout values (if we decide timeout based approach is the right one). At the moment we have timeout based approach to limit the lifetime of folios under writeback. Any other ideas? ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-26 20:13 ` Shakeel Butt @ 2024-12-26 22:02 ` Bernd Schubert 2024-12-27 20:08 ` Joanne Koong 2024-12-30 10:16 ` David Hildenbrand 2 siblings, 0 replies; 124+ messages in thread From: Bernd Schubert @ 2024-12-26 22:02 UTC (permalink / raw) To: Shakeel Butt, David Hildenbrand Cc: Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/26/24 21:13, Shakeel Butt wrote: > On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote: >> On 23.12.24 23:14, Shakeel Butt wrote: >>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: >>>> >>> >>> I think you have valid concerns but these are not new and not unique to >>> fuse. Any filesystem with a potential arbitrary stall can have similar >>> issues. The arbitrary stall can be caused due to network issues or some >>> faultly local storage. >> >> What concerns me more is that this is can be triggered by even unprivileged >> user space, and that there is no default protection as far as I understood, >> because timeouts cannot be set universally to a sane defaults. >> >> Again, please correct me if I got that wrong. >> > > Let's route this question to FUSE folks. More specifically: can an > unprivileged process create a mount point backed by itself, create a > lot of dirty (bound by cgroup) and writeback pages on it and let the > writeback pages in that state forever? libfuse provides 'fusermount' which has the s-bit set. I think most distributions take that over into their libfuse packages. The fuse-server process then continues to run as arbitrary user. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-26 20:13 ` Shakeel Butt 2024-12-26 22:02 ` Bernd Schubert @ 2024-12-27 20:08 ` Joanne Koong 2024-12-27 20:32 ` Bernd Schubert 2024-12-30 10:16 ` David Hildenbrand 2 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-12-27 20:08 UTC (permalink / raw) To: Shakeel Butt Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote: > > On 23.12.24 23:14, Shakeel Butt wrote: > > > On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: > > > [...] > > > > > > > > Yes, so I can see fuse > > > > > > > > (1) Breaking memory reclaim (memory cannot get freed up) > > > > > > > > (2) Breaking page migration (memory cannot be migrated) > > > > > > > > Due to (1) we might experience bigger memory pressure in the system I guess. > > > > A handful of these pages don't really hurt, I have no idea how bad having > > > > many of these pages can be. But yes, inherently we cannot throw away the > > > > data as long as it is dirty without causing harm. (maybe we could move it to > > > > some other cache, like swap/zswap; but that smells like a big and > > > > complicated project) > > > > > > > > Due to (2) we turn pages that are supposed to be movable possibly for a long > > > > time unmovable. Even a *single* such page will mean that CMA allocations / > > > > memory unplug can start failing. > > > > > > > > We have similar situations with page pinning. With things like O_DIRECT, our > > > > assumption/experience so far is that it will only take a couple of seconds > > > > max, and retry loops are sufficient to handle it. That's why only long-term > > > > pinning ("indeterminate", e.g., vfio) migrate these pages out of > > > > ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. > > > > > > > > > > > > The biggest concern I have is that timeouts, while likely reasonable it many > > > > scenarios, might not be desirable even for some sane workloads, and the > > > > default in all system will be "no timeout", letting the clueless admin of > > > > each and every system out there that might support fuse to make a decision. > > > > > > > > I might have misunderstood something, in which case I am very sorry, but we > > > > also don't want CMA allocations to start failing simply because a network > > > > connection is down for a couple of minutes such that a fuse daemon cannot > > > > make progress. > > > > > > > > > > I think you have valid concerns but these are not new and not unique to > > > fuse. Any filesystem with a potential arbitrary stall can have similar > > > issues. The arbitrary stall can be caused due to network issues or some > > > faultly local storage. > > > > What concerns me more is that this is can be triggered by even unprivileged > > user space, and that there is no default protection as far as I understood, > > because timeouts cannot be set universally to a sane defaults. > > > > Again, please correct me if I got that wrong. > > > > Let's route this question to FUSE folks. More specifically: can an > unprivileged process create a mount point backed by itself, create a > lot of dirty (bound by cgroup) and writeback pages on it and let the > writeback pages in that state forever? > > > > > BTW, I just looked at NFS out of interest, in particular > > nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > whereby the TCP default one seems to be around 60s (* retrans?), and the > > privileged user that mounts it can set higher ones. I guess one could run > > into similar writeback issues? > > Yes, I think so. > > > > > So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > I feel like INDETERMINATE in the name is the main cause of confusion. > So, let me explain why it is required (but later I will tell you how it > can be avoided). The FUSE thread which is actively handling writeback of > a given folio can cause memory allocation either through syscall or page > fault. That memory allocation can trigger global reclaim synchronously > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > folio whose writeback it is supposed to end and cauing a deadlock. So, > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > The in-kernel fs avoid this situation through the use of GFP_NOFS > allocations. The userspace fs can also use a similar approach which is > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > told that it is hard to use as it is per-thread flag and has to be set > for all the threads handling writeback which can be error prone if the > threadpool is dynamic. Second it is very coarse such that all the > allocations from those threads (e.g. page faults) become NOFS which > makes userspace very unreliable on highly utilized machine as NOFS can > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > Not > > sure if I grasped all details about NFS and writeback and when it would > > redirty+end writeback, and if there is some other handling in there. > > > [...] > > > > > > Please note that such filesystems are mostly used in environments like > > > data center or hyperscalar and usually have more advanced mechanisms to > > > handle and avoid situations like long delays. For such environment > > > network unavailability is a larger issue than some cma allocation > > > failure. My point is: let's not assume the disastrous situaion is normal > > > and overcomplicate the solution. > > > > Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used > > for movable allocations. > > > > Mechanisms that possible turn these folios unmovable for a > > long/indeterminate time must either fail or migrate these folios out of > > these regions, otherwise we start violating the very semantics why > > ZONE_MOVABLE/MIGRATE_CMA was added in the first place. > > > > Yes, there are corner cases where we cannot guarantee movability (e.g., OOM > > when allocating a migration destination), but these are not cases that can > > be triggered by (unprivileged) user space easily. > > > > That's why FOLL_LONGTERM pinning does exactly that: even if user space would > > promise that this is really only "short-term", we will treat it as "possibly > > forever", because it's under user-space control. > > > > > > Instead of having more subsystems violate these semantics because > > "performance" ... I would hope we would do better. Maybe it's an issue for > > NFS as well ("at least" only for privileged user space)? In which case, > > again, I would hope we would do better. > > > > > > Anyhow, I'm hoping there will be more feedback from other MM folks, but > > likely right now a lot of people are out (just like I should ;) ). > > > > If I end up being the only one with these concerns, then likely people can > > feel free to ignore them. ;) > > I agree we should do better but IMHO it should be an iterative process. > I think your concerns are valid, so let's push the discussion towards > resolving those concerns. I think the concerns can be resolved by better > handling of lifetime of folios under writeback. The amount of such > folios is already handled through existing dirty throttling mechanism. > > We should start with a baseline i.e. distribution of lifetime of folios > under writeback for traditional storage devices (spinning disk and SSDs) > as we don't want an unrealistic goal for ourself. I think this data will > drive the appropriate timeout values (if we decide timeout based > approach is the right one). > > At the moment we have timeout based approach to limit the lifetime of > folios under writeback. Any other ideas? I don't see any other approach that would handle splice, other than modifying the splice code to prevent the underlying buf->page from being migrated while it's being copied out, which seems non-viable to consider. The other alternatives I see are to either a) do the extra temp page copying for splice and "abort" the writeback if migration is triggered or b) gate this to only apply to servers running as privileged. I assume the majority of use cases do use splice, in which case a) would be pointless and would make the internal logic more complicated (eg we would still need the rb tree and would now need to check writeback against the folio writeback state or the rb tree, etc). I'm not sure how useful this would be either if this is just gated to privileged servers. Thanks, Joanne ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-27 20:08 ` Joanne Koong @ 2024-12-27 20:32 ` Bernd Schubert 2024-12-30 17:52 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-27 20:32 UTC (permalink / raw) To: Joanne Koong, Shakeel Butt Cc: David Hildenbrand, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/27/24 21:08, Joanne Koong wrote: > On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> >> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote: >>> On 23.12.24 23:14, Shakeel Butt wrote: >>>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: >>>> [...] >>>>> >>>>> Yes, so I can see fuse >>>>> >>>>> (1) Breaking memory reclaim (memory cannot get freed up) >>>>> >>>>> (2) Breaking page migration (memory cannot be migrated) >>>>> >>>>> Due to (1) we might experience bigger memory pressure in the system I guess. >>>>> A handful of these pages don't really hurt, I have no idea how bad having >>>>> many of these pages can be. But yes, inherently we cannot throw away the >>>>> data as long as it is dirty without causing harm. (maybe we could move it to >>>>> some other cache, like swap/zswap; but that smells like a big and >>>>> complicated project) >>>>> >>>>> Due to (2) we turn pages that are supposed to be movable possibly for a long >>>>> time unmovable. Even a *single* such page will mean that CMA allocations / >>>>> memory unplug can start failing. >>>>> >>>>> We have similar situations with page pinning. With things like O_DIRECT, our >>>>> assumption/experience so far is that it will only take a couple of seconds >>>>> max, and retry loops are sufficient to handle it. That's why only long-term >>>>> pinning ("indeterminate", e.g., vfio) migrate these pages out of >>>>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. >>>>> >>>>> >>>>> The biggest concern I have is that timeouts, while likely reasonable it many >>>>> scenarios, might not be desirable even for some sane workloads, and the >>>>> default in all system will be "no timeout", letting the clueless admin of >>>>> each and every system out there that might support fuse to make a decision. >>>>> >>>>> I might have misunderstood something, in which case I am very sorry, but we >>>>> also don't want CMA allocations to start failing simply because a network >>>>> connection is down for a couple of minutes such that a fuse daemon cannot >>>>> make progress. >>>>> >>>> >>>> I think you have valid concerns but these are not new and not unique to >>>> fuse. Any filesystem with a potential arbitrary stall can have similar >>>> issues. The arbitrary stall can be caused due to network issues or some >>>> faultly local storage. >>> >>> What concerns me more is that this is can be triggered by even unprivileged >>> user space, and that there is no default protection as far as I understood, >>> because timeouts cannot be set universally to a sane defaults. >>> >>> Again, please correct me if I got that wrong. >>> >> >> Let's route this question to FUSE folks. More specifically: can an >> unprivileged process create a mount point backed by itself, create a >> lot of dirty (bound by cgroup) and writeback pages on it and let the >> writeback pages in that state forever? >> >>> >>> BTW, I just looked at NFS out of interest, in particular >>> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + >>> canceling writeback. IIUC, there are default timeouts for UDP and TCP, >>> whereby the TCP default one seems to be around 60s (* retrans?), and the >>> privileged user that mounts it can set higher ones. I guess one could run >>> into similar writeback issues? >> >> Yes, I think so. >> >>> >>> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? >> >> I feel like INDETERMINATE in the name is the main cause of confusion. >> So, let me explain why it is required (but later I will tell you how it >> can be avoided). The FUSE thread which is actively handling writeback of >> a given folio can cause memory allocation either through syscall or page >> fault. That memory allocation can trigger global reclaim synchronously >> and in cgroup-v1, that FUSE thread can wait on the writeback on the same >> folio whose writeback it is supposed to end and cauing a deadlock. So, >> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. >> >> The in-kernel fs avoid this situation through the use of GFP_NOFS >> allocations. The userspace fs can also use a similar approach which is >> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been >> told that it is hard to use as it is per-thread flag and has to be set >> for all the threads handling writeback which can be error prone if the >> threadpool is dynamic. Second it is very coarse such that all the >> allocations from those threads (e.g. page faults) become NOFS which >> makes userspace very unreliable on highly utilized machine as NOFS can >> not reclaim potentially a lot of memory and can not trigger oom-kill. >> >>> Not >>> sure if I grasped all details about NFS and writeback and when it would >>> redirty+end writeback, and if there is some other handling in there. >>> >> [...] >>>> >>>> Please note that such filesystems are mostly used in environments like >>>> data center or hyperscalar and usually have more advanced mechanisms to >>>> handle and avoid situations like long delays. For such environment >>>> network unavailability is a larger issue than some cma allocation >>>> failure. My point is: let's not assume the disastrous situaion is normal >>>> and overcomplicate the solution. >>> >>> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used >>> for movable allocations. >>> >>> Mechanisms that possible turn these folios unmovable for a >>> long/indeterminate time must either fail or migrate these folios out of >>> these regions, otherwise we start violating the very semantics why >>> ZONE_MOVABLE/MIGRATE_CMA was added in the first place. >>> >>> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM >>> when allocating a migration destination), but these are not cases that can >>> be triggered by (unprivileged) user space easily. >>> >>> That's why FOLL_LONGTERM pinning does exactly that: even if user space would >>> promise that this is really only "short-term", we will treat it as "possibly >>> forever", because it's under user-space control. >>> >>> >>> Instead of having more subsystems violate these semantics because >>> "performance" ... I would hope we would do better. Maybe it's an issue for >>> NFS as well ("at least" only for privileged user space)? In which case, >>> again, I would hope we would do better. >>> >>> >>> Anyhow, I'm hoping there will be more feedback from other MM folks, but >>> likely right now a lot of people are out (just like I should ;) ). >>> >>> If I end up being the only one with these concerns, then likely people can >>> feel free to ignore them. ;) >> >> I agree we should do better but IMHO it should be an iterative process. >> I think your concerns are valid, so let's push the discussion towards >> resolving those concerns. I think the concerns can be resolved by better >> handling of lifetime of folios under writeback. The amount of such >> folios is already handled through existing dirty throttling mechanism. >> >> We should start with a baseline i.e. distribution of lifetime of folios >> under writeback for traditional storage devices (spinning disk and SSDs) >> as we don't want an unrealistic goal for ourself. I think this data will >> drive the appropriate timeout values (if we decide timeout based >> approach is the right one). >> >> At the moment we have timeout based approach to limit the lifetime of >> folios under writeback. Any other ideas? > > I don't see any other approach that would handle splice, other than > modifying the splice code to prevent the underlying buf->page from > being migrated while it's being copied out, which seems non-viable to > consider. The other alternatives I see are to either a) do the extra > temp page copying for splice and "abort" the writeback if migration is > triggered or b) gate this to only apply to servers running as > privileged. I assume the majority of use cases do use splice, in which > case a) would be pointless and would make the internal logic more > complicated (eg we would still need the rb tree and would now need to > check writeback against the folio writeback state or the rb tree, > etc). I'm not sure how useful this would be either if this is just > gated to privileged servers. I'm not so sure about that majority of unprivileged servers. Try this patch and then run an unprivileged process. diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c index ee0b3b1d0470..adebfbc03d4c 100644 --- a/lib/fuse_lowlevel.c +++ b/lib/fuse_lowlevel.c @@ -3588,6 +3588,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, res = fcntl(llp->pipe[0], F_SETPIPE_SZ, bufsize); if (res == -1) { llp->can_grow = 0; + fuse_log(FUSE_LOG_ERR, "cannot grow pipe\n"); res = grow_pipe_to_max(llp->pipe[0]); if (res > 0) llp->size = res; @@ -3678,6 +3679,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, } else { /* Don't overwrite buf->mem, as that would cause a leak */ + fuse_log(FUSE_LOG_WARNING, "Using splice\n"); buf->fd = tmpbuf.fd; buf->flags = tmpbuf.flags; } @@ -3687,6 +3689,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, fallback: #endif + fuse_log(FUSE_LOG_WARNING, "Splice fallback\n"); if (!buf->mem) { buf->mem = buf_alloc(se->bufsize, internal); if (!buf->mem) { And then run this again after sudo sysctl -w fs.pipe-max-size=1052672 (Please don't change '/proc/sys/fs/fuse/max_pages_limit' from default). And now we would need to know how many users either limit max-pages + header to fit default pipe-max-size (1MB) or increase max_pages_limit. Given there is no warning in libfuse about the fallback from splice to buf copy, I doubt many people know about that - who would change system defaults without the knowledge? And then, I still doubt that copy-to-tmp-page-and-splice is any faster than no-tmp-page-copy-but-copy-to-lib-fuse-buffer. Especially as the tmp page copy is single threaded, I think. But needs to be benchmarked. Thanks, Bernd ^ permalink raw reply related [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-27 20:32 ` Bernd Schubert @ 2024-12-30 17:52 ` Joanne Koong 0 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2024-12-30 17:52 UTC (permalink / raw) To: Bernd Schubert Cc: Shakeel Butt, David Hildenbrand, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, Dec 27, 2024 at 12:32 PM Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > On 12/27/24 21:08, Joanne Koong wrote: > > On Thu, Dec 26, 2024 at 12:13 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > >> > >> On Tue, Dec 24, 2024 at 01:37:49PM +0100, David Hildenbrand wrote: > >>> On 23.12.24 23:14, Shakeel Butt wrote: > >>>> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote: > >>>> [...] > >>>>> > >>>>> Yes, so I can see fuse > >>>>> > >>>>> (1) Breaking memory reclaim (memory cannot get freed up) > >>>>> > >>>>> (2) Breaking page migration (memory cannot be migrated) > >>>>> > >>>>> Due to (1) we might experience bigger memory pressure in the system I guess. > >>>>> A handful of these pages don't really hurt, I have no idea how bad having > >>>>> many of these pages can be. But yes, inherently we cannot throw away the > >>>>> data as long as it is dirty without causing harm. (maybe we could move it to > >>>>> some other cache, like swap/zswap; but that smells like a big and > >>>>> complicated project) > >>>>> > >>>>> Due to (2) we turn pages that are supposed to be movable possibly for a long > >>>>> time unmovable. Even a *single* such page will mean that CMA allocations / > >>>>> memory unplug can start failing. > >>>>> > >>>>> We have similar situations with page pinning. With things like O_DIRECT, our > >>>>> assumption/experience so far is that it will only take a couple of seconds > >>>>> max, and retry loops are sufficient to handle it. That's why only long-term > >>>>> pinning ("indeterminate", e.g., vfio) migrate these pages out of > >>>>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them. > >>>>> > >>>>> > >>>>> The biggest concern I have is that timeouts, while likely reasonable it many > >>>>> scenarios, might not be desirable even for some sane workloads, and the > >>>>> default in all system will be "no timeout", letting the clueless admin of > >>>>> each and every system out there that might support fuse to make a decision. > >>>>> > >>>>> I might have misunderstood something, in which case I am very sorry, but we > >>>>> also don't want CMA allocations to start failing simply because a network > >>>>> connection is down for a couple of minutes such that a fuse daemon cannot > >>>>> make progress. > >>>>> > >>>> > >>>> I think you have valid concerns but these are not new and not unique to > >>>> fuse. Any filesystem with a potential arbitrary stall can have similar > >>>> issues. The arbitrary stall can be caused due to network issues or some > >>>> faultly local storage. > >>> > >>> What concerns me more is that this is can be triggered by even unprivileged > >>> user space, and that there is no default protection as far as I understood, > >>> because timeouts cannot be set universally to a sane defaults. > >>> > >>> Again, please correct me if I got that wrong. > >>> > >> > >> Let's route this question to FUSE folks. More specifically: can an > >> unprivileged process create a mount point backed by itself, create a > >> lot of dirty (bound by cgroup) and writeback pages on it and let the > >> writeback pages in that state forever? > >> > >>> > >>> BTW, I just looked at NFS out of interest, in particular > >>> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > >>> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > >>> whereby the TCP default one seems to be around 60s (* retrans?), and the > >>> privileged user that mounts it can set higher ones. I guess one could run > >>> into similar writeback issues? > >> > >> Yes, I think so. > >> > >>> > >>> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > >> > >> I feel like INDETERMINATE in the name is the main cause of confusion. > >> So, let me explain why it is required (but later I will tell you how it > >> can be avoided). The FUSE thread which is actively handling writeback of > >> a given folio can cause memory allocation either through syscall or page > >> fault. That memory allocation can trigger global reclaim synchronously > >> and in cgroup-v1, that FUSE thread can wait on the writeback on the same > >> folio whose writeback it is supposed to end and cauing a deadlock. So, > >> AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > >> > >> The in-kernel fs avoid this situation through the use of GFP_NOFS > >> allocations. The userspace fs can also use a similar approach which is > >> prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > >> told that it is hard to use as it is per-thread flag and has to be set > >> for all the threads handling writeback which can be error prone if the > >> threadpool is dynamic. Second it is very coarse such that all the > >> allocations from those threads (e.g. page faults) become NOFS which > >> makes userspace very unreliable on highly utilized machine as NOFS can > >> not reclaim potentially a lot of memory and can not trigger oom-kill. > >> > >>> Not > >>> sure if I grasped all details about NFS and writeback and when it would > >>> redirty+end writeback, and if there is some other handling in there. > >>> > >> [...] > >>>> > >>>> Please note that such filesystems are mostly used in environments like > >>>> data center or hyperscalar and usually have more advanced mechanisms to > >>>> handle and avoid situations like long delays. For such environment > >>>> network unavailability is a larger issue than some cma allocation > >>>> failure. My point is: let's not assume the disastrous situaion is normal > >>>> and overcomplicate the solution. > >>> > >>> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used > >>> for movable allocations. > >>> > >>> Mechanisms that possible turn these folios unmovable for a > >>> long/indeterminate time must either fail or migrate these folios out of > >>> these regions, otherwise we start violating the very semantics why > >>> ZONE_MOVABLE/MIGRATE_CMA was added in the first place. > >>> > >>> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM > >>> when allocating a migration destination), but these are not cases that can > >>> be triggered by (unprivileged) user space easily. > >>> > >>> That's why FOLL_LONGTERM pinning does exactly that: even if user space would > >>> promise that this is really only "short-term", we will treat it as "possibly > >>> forever", because it's under user-space control. > >>> > >>> > >>> Instead of having more subsystems violate these semantics because > >>> "performance" ... I would hope we would do better. Maybe it's an issue for > >>> NFS as well ("at least" only for privileged user space)? In which case, > >>> again, I would hope we would do better. > >>> > >>> > >>> Anyhow, I'm hoping there will be more feedback from other MM folks, but > >>> likely right now a lot of people are out (just like I should ;) ). > >>> > >>> If I end up being the only one with these concerns, then likely people can > >>> feel free to ignore them. ;) > >> > >> I agree we should do better but IMHO it should be an iterative process. > >> I think your concerns are valid, so let's push the discussion towards > >> resolving those concerns. I think the concerns can be resolved by better > >> handling of lifetime of folios under writeback. The amount of such > >> folios is already handled through existing dirty throttling mechanism. > >> > >> We should start with a baseline i.e. distribution of lifetime of folios > >> under writeback for traditional storage devices (spinning disk and SSDs) > >> as we don't want an unrealistic goal for ourself. I think this data will > >> drive the appropriate timeout values (if we decide timeout based > >> approach is the right one). > >> > >> At the moment we have timeout based approach to limit the lifetime of > >> folios under writeback. Any other ideas? > > > > I don't see any other approach that would handle splice, other than > > modifying the splice code to prevent the underlying buf->page from > > being migrated while it's being copied out, which seems non-viable to > > consider. The other alternatives I see are to either a) do the extra > > temp page copying for splice and "abort" the writeback if migration is > > triggered or b) gate this to only apply to servers running as > > privileged. I assume the majority of use cases do use splice, in which > > case a) would be pointless and would make the internal logic more > > complicated (eg we would still need the rb tree and would now need to > > check writeback against the folio writeback state or the rb tree, > > etc). I'm not sure how useful this would be either if this is just > > gated to privileged servers. > > > I'm not so sure about that majority of unprivileged servers. > Try this patch and then run an unprivileged process. > > diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c > index ee0b3b1d0470..adebfbc03d4c 100644 > --- a/lib/fuse_lowlevel.c > +++ b/lib/fuse_lowlevel.c > @@ -3588,6 +3588,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, > res = fcntl(llp->pipe[0], F_SETPIPE_SZ, bufsize); > if (res == -1) { > llp->can_grow = 0; > + fuse_log(FUSE_LOG_ERR, "cannot grow pipe\n"); > res = grow_pipe_to_max(llp->pipe[0]); > if (res > 0) > llp->size = res; > @@ -3678,6 +3679,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, > > } else { > /* Don't overwrite buf->mem, as that would cause a leak */ > + fuse_log(FUSE_LOG_WARNING, "Using splice\n"); > buf->fd = tmpbuf.fd; > buf->flags = tmpbuf.flags; > } > @@ -3687,6 +3689,7 @@ static int _fuse_session_receive_buf(struct fuse_session *se, > > fallback: > #endif > + fuse_log(FUSE_LOG_WARNING, "Splice fallback\n"); > if (!buf->mem) { > buf->mem = buf_alloc(se->bufsize, internal); > if (!buf->mem) { > > > And then run this again after > sudo sysctl -w fs.pipe-max-size=1052672 > > (Please don't change '/proc/sys/fs/fuse/max_pages_limit' > from default). > > And now we would need to know how many users either limit > max-pages + header to fit default pipe-max-size (1MB) or > increase max_pages_limit. Given there is no warning in > libfuse about the fallback from splice to buf copy, I doubt > many people know about that - who would change system > defaults without the knowledge? > My concern is that this would break backwards compatibility for the rare subset of users who use their own custom library instead of libfuse, who expect splice to work as-is and might not have this in-built fallback to buffer copies. Thanks, Joanne > > And then, I still doubt that copy-to-tmp-page-and-splice > is any faster than no-tmp-page-copy-but-copy-to-lib-fuse-buffer. > Especially as the tmp page copy is single threaded, I think. > But needs to be benchmarked. > > > Thanks, > Bernd > > > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-26 20:13 ` Shakeel Butt 2024-12-26 22:02 ` Bernd Schubert 2024-12-27 20:08 ` Joanne Koong @ 2024-12-30 10:16 ` David Hildenbrand 2024-12-30 18:38 ` Joanne Koong 2 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-30 10:16 UTC (permalink / raw) To: Shakeel Butt Cc: Bernd Schubert, Joanne Koong, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko >> BTW, I just looked at NFS out of interest, in particular >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, >> whereby the TCP default one seems to be around 60s (* retrans?), and the >> privileged user that mounts it can set higher ones. I guess one could run >> into similar writeback issues? > Hi, sorry for the late reply. > Yes, I think so. > >> >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > I feel like INDETERMINATE in the name is the main cause of confusion. We are adding logic that says "unconditionally, never wait on writeback for these folios, not even any sync migration". That's the main problem I have. Your explanation below is helpful. Because ... > So, let me explain why it is required (but later I will tell you how it > can be avoided). The FUSE thread which is actively handling writeback of > a given folio can cause memory allocation either through syscall or page > fault. That memory allocation can trigger global reclaim synchronously > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > folio whose writeback it is supposed to end and cauing a deadlock. So, > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > The in-kernel fs avoid this situation through the use of GFP_NOFS > allocations. The userspace fs can also use a similar approach which is > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > told that it is hard to use as it is per-thread flag and has to be set > for all the threads handling writeback which can be error prone if the > threadpool is dynamic. Second it is very coarse such that all the > allocations from those threads (e.g. page faults) become NOFS which > makes userspace very unreliable on highly utilized machine as NOFS can > not reclaim potentially a lot of memory and can not trigger oom-kill. > ... now I understand that we want to prevent a deadlock in one specific scenario only? What sounds plausible for me is: a) Make this only affect the actual deadlock path: sync migration during compaction. Communicate it either using some "context" information or with a new MIGRATE_SYNC_COMPACTION. b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express that very deadlock problem. c) Leave all others sync migration users alone for now Would that prevent the deadlock? Even *better* would be to to be able to ask the fs if starting writeback on a specific folio could deadlock. Because in most cases, as I understand, we'll not actually run into the deadlock and would just want to wait for writeback to just complete (esp. compaction). (I still think having folios under writeback for a long time might be a problem, but that's indeed something to sort out separately in the future, because I suspect NFS has similar issues. We'd want to "wait with timeout" and e.g., cancel writeback during memory offlining/alloc_cma ...) >> Not >> sure if I grasped all details about NFS and writeback and when it would >> redirty+end writeback, and if there is some other handling in there. >> > [...] >>> >>> Please note that such filesystems are mostly used in environments like >>> data center or hyperscalar and usually have more advanced mechanisms to >>> handle and avoid situations like long delays. For such environment >>> network unavailability is a larger issue than some cma allocation >>> failure. My point is: let's not assume the disastrous situaion is normal >>> and overcomplicate the solution. >> >> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used >> for movable allocations. >> >> Mechanisms that possible turn these folios unmovable for a >> long/indeterminate time must either fail or migrate these folios out of >> these regions, otherwise we start violating the very semantics why >> ZONE_MOVABLE/MIGRATE_CMA was added in the first place. >> >> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM >> when allocating a migration destination), but these are not cases that can >> be triggered by (unprivileged) user space easily. >> >> That's why FOLL_LONGTERM pinning does exactly that: even if user space would >> promise that this is really only "short-term", we will treat it as "possibly >> forever", because it's under user-space control. >> >> >> Instead of having more subsystems violate these semantics because >> "performance" ... I would hope we would do better. Maybe it's an issue for >> NFS as well ("at least" only for privileged user space)? In which case, >> again, I would hope we would do better. >> >> >> Anyhow, I'm hoping there will be more feedback from other MM folks, but >> likely right now a lot of people are out (just like I should ;) ). >> >> If I end up being the only one with these concerns, then likely people can >> feel free to ignore them. ;) > > I agree we should do better but IMHO it should be an iterative process. > I think your concerns are valid, so let's push the discussion towards> resolving those concerns. I think the concerns can be resolved by better > handling of lifetime of folios under writeback. The amount of such > folios is already handled through existing dirty throttling mechanism. > > We should start with a baseline i.e. distribution of lifetime of folios > under writeback for traditional storage devices (spinning disk and SSDs) > as we don't want an unrealistic goal for ourself. I think this data will > drive the appropriate timeout values (if we decide timeout based > approach is the right one). > > At the moment we have timeout based approach to limit the lifetime of > folios under writeback. Any other ideas? See above, maybe we could limit the deadlock avoidance to the actual deadlock path and sort out the "infinite writeback in some corner cases" problem separately. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-30 10:16 ` David Hildenbrand @ 2024-12-30 18:38 ` Joanne Koong 2024-12-30 19:52 ` David Hildenbrand 2024-12-30 20:04 ` Shakeel Butt 0 siblings, 2 replies; 124+ messages in thread From: Joanne Koong @ 2024-12-30 18:38 UTC (permalink / raw) To: David Hildenbrand Cc: Shakeel Butt, Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote: > > >> BTW, I just looked at NFS out of interest, in particular > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > >> privileged user that mounts it can set higher ones. I guess one could run > >> into similar writeback issues? > > > > Hi, > > sorry for the late reply. > > > Yes, I think so. > > > >> > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > We are adding logic that says "unconditionally, never wait on writeback > for these folios, not even any sync migration". That's the main problem > I have. > > Your explanation below is helpful. Because ... > > > So, let me explain why it is required (but later I will tell you how it > > can be avoided). The FUSE thread which is actively handling writeback of > > a given folio can cause memory allocation either through syscall or page > > fault. That memory allocation can trigger global reclaim synchronously > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > allocations. The userspace fs can also use a similar approach which is > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > told that it is hard to use as it is per-thread flag and has to be set > > for all the threads handling writeback which can be error prone if the > > threadpool is dynamic. Second it is very coarse such that all the > > allocations from those threads (e.g. page faults) become NOFS which > > makes userspace very unreliable on highly utilized machine as NOFS can > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > ... now I understand that we want to prevent a deadlock in one specific > scenario only? > > What sounds plausible for me is: > > a) Make this only affect the actual deadlock path: sync migration > during compaction. Communicate it either using some "context" > information or with a new MIGRATE_SYNC_COMPACTION. > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > that very deadlock problem. > c) Leave all others sync migration users alone for now The deadlock path is separate from sync migration. The deadlock arises from a corner case where cgroupv1 reclaim waits on a folio under writeback where that writeback itself is blocked on reclaim. > > Would that prevent the deadlock? Even *better* would be to to be able to > ask the fs if starting writeback on a specific folio could deadlock. > Because in most cases, as I understand, we'll not actually run into the > deadlock and would just want to wait for writeback to just complete > (esp. compaction). > > (I still think having folios under writeback for a long time might be a > problem, but that's indeed something to sort out separately in the > future, because I suspect NFS has similar issues. We'd want to "wait > with timeout" and e.g., cancel writeback during memory > offlining/alloc_cma ...) I'm looking back at some of the discussions in v2 [1] and I'm still not clear on how memory fragmentation for non-movable pages differs from memory fragmentation from movable pages and whether one is worse than the other. Currently fuse uses movable temp pages (allocated with gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same issue where a buggy/malicious server may never complete writeback. This has the same effect of fragmenting memory and has a worse memory cost to the system in terms of memory used. With not having temp pages though, now in this scenario, pages allocated in a movable page block can't be compacted and that memory is fragmented. My (basic and maybe incorrect) understanding is that memory gets allocated through a buddy allocator and moveable vs nonmovable pages get allocated to corresponding blocks that match their type, but there's no other difference otherwise. Is this understanding correct? Or is there some substantial difference between fragmentation for movable vs nonmovable blocks? Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/T/#m7637e26a559db86348461ebc1104352083085d6d > > >> Not > >> sure if I grasped all details about NFS and writeback and when it would > >> redirty+end writeback, and if there is some other handling in there. > >> > > [...] > >>> > >>> Please note that such filesystems are mostly used in environments like > >>> data center or hyperscalar and usually have more advanced mechanisms to > >>> handle and avoid situations like long delays. For such environment > >>> network unavailability is a larger issue than some cma allocation > >>> failure. My point is: let's not assume the disastrous situaion is normal > >>> and overcomplicate the solution. > >> > >> Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be used > >> for movable allocations. > >> > >> Mechanisms that possible turn these folios unmovable for a > >> long/indeterminate time must either fail or migrate these folios out of > >> these regions, otherwise we start violating the very semantics why > >> ZONE_MOVABLE/MIGRATE_CMA was added in the first place. > >> > >> Yes, there are corner cases where we cannot guarantee movability (e.g., OOM > >> when allocating a migration destination), but these are not cases that can > >> be triggered by (unprivileged) user space easily. > >> > >> That's why FOLL_LONGTERM pinning does exactly that: even if user space would > >> promise that this is really only "short-term", we will treat it as "possibly > >> forever", because it's under user-space control. > >> > >> > >> Instead of having more subsystems violate these semantics because > >> "performance" ... I would hope we would do better. Maybe it's an issue for > >> NFS as well ("at least" only for privileged user space)? In which case, > >> again, I would hope we would do better. > >> > >> > >> Anyhow, I'm hoping there will be more feedback from other MM folks, but > >> likely right now a lot of people are out (just like I should ;) ). > >> > >> If I end up being the only one with these concerns, then likely people can > >> feel free to ignore them. ;) > > > > I agree we should do better but IMHO it should be an iterative process. > > I think your concerns are valid, so let's push the discussion > towards> resolving those concerns. I think the concerns can be resolved > by better > > handling of lifetime of folios under writeback. The amount of such > > folios is already handled through existing dirty throttling mechanism. > > > > We should start with a baseline i.e. distribution of lifetime of folios > > under writeback for traditional storage devices (spinning disk and SSDs) > > as we don't want an unrealistic goal for ourself. I think this data will > > drive the appropriate timeout values (if we decide timeout based > > approach is the right one). > > > > At the moment we have timeout based approach to limit the lifetime of > > folios under writeback. Any other ideas? > > See above, maybe we could limit the deadlock avoidance to the actual > deadlock path and sort out the "infinite writeback in some corner cases" > problem separately. > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-30 18:38 ` Joanne Koong @ 2024-12-30 19:52 ` David Hildenbrand 2024-12-30 20:11 ` Shakeel Butt 2024-12-30 20:04 ` Shakeel Butt 1 sibling, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-30 19:52 UTC (permalink / raw) To: Joanne Koong Cc: Shakeel Butt, Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko >> >> What sounds plausible for me is: >> >> a) Make this only affect the actual deadlock path: sync migration >> during compaction. Communicate it either using some "context" >> information or with a new MIGRATE_SYNC_COMPACTION. >> b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express >> that very deadlock problem. >> c) Leave all others sync migration users alone for now > > The deadlock path is separate from sync migration. The deadlock arises > from a corner case where cgroupv1 reclaim waits on a folio under > writeback where that writeback itself is blocked on reclaim. Okay, so compaction (IOW this patch) is not relevant at all to resolve the deadlock in any way, correct? For a second I thought I understood how this patch here relates to the deadlock :) > >> >> Would that prevent the deadlock? Even *better* would be to to be able to >> ask the fs if starting writeback on a specific folio could deadlock. >> Because in most cases, as I understand, we'll not actually run into the >> deadlock and would just want to wait for writeback to just complete >> (esp. compaction). >> >> (I still think having folios under writeback for a long time might be a >> problem, but that's indeed something to sort out separately in the >> future, because I suspect NFS has similar issues. We'd want to "wait >> with timeout" and e.g., cancel writeback during memory >> offlining/alloc_cma ...) > > I'm looking back at some of the discussions in v2 [1] and I'm still > not clear on how memory fragmentation for non-movable pages differs > from memory fragmentation from movable pages and whether one is worse > than the other. Currently fuse uses movable temp pages (allocated with > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same Why are they movable? Do you also specify __GFP_MOVABLE? If not, they are unmovable and are never allocated from ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to group these unmovable pages. > issue where a buggy/malicious server may never complete writeback. If the temp pages are not allocated using __GFP_MOVABLE, they are just like any other kernel allocation -- unmovable. Nobody would even try migrating them, ever. And they are allocated from memory regions where that is expected. > This has the same effect of fragmenting memory and has a worse memory > cost to the system in terms of memory used. With not having temp pages > though, now in this scenario, pages allocated in a movable page block > can't be compacted and that memory is fragmented. Yes. With temp pages, they simply grouped naturally "where they belong". After all, pagecache pages are allocated using __GFP_MOVABLE, which implies "this thing is movable" -- so the buddy can place them in physical memory regions that allow only for movable allocations or minimize fragmentation. > My (basic and maybe > incorrect) understanding is that memory gets allocated through a buddy > allocator and moveable vs nonmovable pages get allocated to > corresponding blocks that match their type, but there's no other > difference otherwise. Is this understanding correct? Or is there some > substantial difference between fragmentation for movable vs nonmovable > blocks? I assume not regarding fragmentation. In general, I see two main issues: A) We are no longer waiting on writeback, even though we expect in sane environments that writeback will happen and we it might be worthwhile to just wait for writeback so we can migrate these folios. B) We allow turning movable pages to be unmovable, possibly forever/long time, and there is no way to make them movable again (e.g., cancel writeback). I'm wondering if A) is actually a new issue introduced by this change. Can folios with busy temp pages (writeback cleared on folio, but temp pages are still around) be migrated? I will look into some details once I'm back from vacation. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-30 19:52 ` David Hildenbrand @ 2024-12-30 20:11 ` Shakeel Butt 2025-01-02 18:54 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-30 20:11 UTC (permalink / raw) To: David Hildenbrand Cc: Joanne Koong, Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, Dec 30, 2024 at 08:52:04PM +0100, David Hildenbrand wrote: > [...] > > I'm looking back at some of the discussions in v2 [1] and I'm still > > not clear on how memory fragmentation for non-movable pages differs > > from memory fragmentation from movable pages and whether one is worse > > than the other. Currently fuse uses movable temp pages (allocated with > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > > Why are they movable? Do you also specify __GFP_MOVABLE? > > If not, they are unmovable and are never allocated from > ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to > group these unmovable pages. > Yes, these temp pages are non-movable. (Must be a typo in Joanne's email). [...] > > I assume not regarding fragmentation. > > > In general, I see two main issues: > > A) We are no longer waiting on writeback, even though we expect in sane > environments that writeback will happen and we it might be worthwhile to > just wait for writeback so we can migrate these folios. > > B) We allow turning movable pages to be unmovable, possibly forever/long > time, and there is no way to make them movable again (e.g., cancel > writeback). > > > I'm wondering if A) is actually a new issue introduced by this change. Can > folios with busy temp pages (writeback cleared on folio, but temp pages are > still around) be migrated? I will look into some details once I'm back from > vacation. > My suggestion is to just drop the patch related to A as it is not required for deadlock avoidance. For B, I think we need a long term solution which is usable by other filesystems as well. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-30 20:11 ` Shakeel Butt @ 2025-01-02 18:54 ` Joanne Koong 2025-01-03 20:31 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2025-01-02 18:54 UTC (permalink / raw) To: Shakeel Butt Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, Dec 30, 2024 at 12:11 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Dec 30, 2024 at 08:52:04PM +0100, David Hildenbrand wrote: > > > [...] > > > I'm looking back at some of the discussions in v2 [1] and I'm still > > > not clear on how memory fragmentation for non-movable pages differs > > > from memory fragmentation from movable pages and whether one is worse > > > than the other. Currently fuse uses movable temp pages (allocated with > > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > > > > Why are they movable? Do you also specify __GFP_MOVABLE? > > > > If not, they are unmovable and are never allocated from > > ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to > > group these unmovable pages. > > > > Yes, these temp pages are non-movable. (Must be a typo in Joanne's > email). Sorry for the confusion, that should have been "non-movable temp pages". > > [...] > > > > I assume not regarding fragmentation. > > > > > > In general, I see two main issues: > > > > A) We are no longer waiting on writeback, even though we expect in sane > > environments that writeback will happen and we it might be worthwhile to > > just wait for writeback so we can migrate these folios. > > > > B) We allow turning movable pages to be unmovable, possibly forever/long > > time, and there is no way to make them movable again (e.g., cancel > > writeback). > > > > > > I'm wondering if A) is actually a new issue introduced by this change. Can > > folios with busy temp pages (writeback cleared on folio, but temp pages are > > still around) be migrated? I will look into some details once I'm back from > > vacation. > > Folios with busy temp pages can be migrated since fuse will clear writeback on the folio immediately once it's copied to the temp page. To me, these two issues seem like one and the same. No longer waiting on writeback renders it unmovable, which prevents compaction/migration. > > My suggestion is to just drop the patch related to A as it is not > required for deadlock avoidance. For B, I think we need a long term > solution which is usable by other filesystems as well. Sounds good. With that, we need to take this patchset out of mm-unstable or this could lead to migration infinitely waiting on folio writeback without the migrate patch there. Thanks, Joanne ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-02 18:54 ` Joanne Koong @ 2025-01-03 20:31 ` David Hildenbrand 2025-01-06 10:19 ` Miklos Szeredi 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2025-01-03 20:31 UTC (permalink / raw) To: Joanne Koong, Shakeel Butt Cc: Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 02.01.25 19:54, Joanne Koong wrote: > On Mon, Dec 30, 2024 at 12:11 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> >> On Mon, Dec 30, 2024 at 08:52:04PM +0100, David Hildenbrand wrote: >>> >> [...] >>>> I'm looking back at some of the discussions in v2 [1] and I'm still >>>> not clear on how memory fragmentation for non-movable pages differs >>>> from memory fragmentation from movable pages and whether one is worse >>>> than the other. Currently fuse uses movable temp pages (allocated with >>>> gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same >>> >>> Why are they movable? Do you also specify __GFP_MOVABLE? >>> >>> If not, they are unmovable and are never allocated from >>> ZONE_MOVABLE/MIGRATE_CMA -- and usually only from MIGRATE_UNMOVBALE, to >>> group these unmovable pages. >>> >> >> Yes, these temp pages are non-movable. (Must be a typo in Joanne's >> email). > > Sorry for the confusion, that should have been "non-movable temp pages". > >> >> [...] >>> >>> I assume not regarding fragmentation. >>> >>> >>> In general, I see two main issues: >>> >>> A) We are no longer waiting on writeback, even though we expect in sane >>> environments that writeback will happen and we it might be worthwhile to >>> just wait for writeback so we can migrate these folios. >>> >>> B) We allow turning movable pages to be unmovable, possibly forever/long >>> time, and there is no way to make them movable again (e.g., cancel >>> writeback). >>> >>> >>> I'm wondering if A) is actually a new issue introduced by this change. Can >>> folios with busy temp pages (writeback cleared on folio, but temp pages are >>> still around) be migrated? I will look into some details once I'm back from >>> vacation. >>> > > Folios with busy temp pages can be migrated since fuse will clear > writeback on the folio immediately once it's copied to the temp page. I was rather wondering if there is something else that prevents migrating these folios: for example, if there is a raised refcount on the folio while the temp pages exist. If that is not the case, then it should indeed just work. > > To me, these two issues seem like one and the same. No longer waiting > on writeback renders it unmovable, which prevents > compaction/migration. > >> >> My suggestion is to just drop the patch related to A as it is not >> required for deadlock avoidance. For B, I think we need a long term >> solution which is usable by other filesystems as well. > > Sounds good. With that, we need to take this patchset out of > mm-unstable or this could lead to migration infinitely waiting on > folio writeback without the migrate patch there. I want to try triggering it with NFS next week when I am back from PTO, to see if it is indeed a problem there as well on connection loss. In any case, having movable pages be turned unmovable due to persistent writaback is something that must be fixed, not worked around. Likely a good topic for LSF/MM. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-03 20:31 ` David Hildenbrand @ 2025-01-06 10:19 ` Miklos Szeredi 2025-01-06 18:17 ` Shakeel Butt 0 siblings, 1 reply; 124+ messages in thread From: Miklos Szeredi @ 2025-01-06 10:19 UTC (permalink / raw) To: David Hildenbrand Cc: Joanne Koong, Shakeel Butt, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > In any case, having movable pages be turned unmovable due to persistent > writaback is something that must be fixed, not worked around. Likely a > good topic for LSF/MM. Yes, this seems a good cross fs-mm topic. So the issue discussed here is that movable pages used for fuse page-cache cause a problems when memory needs to be compacted. The problem is either that - the page is skipped, leaving the physical memory block unmovable - the compaction is blocked for an unbounded time While the new AS_WRITEBACK_INDETERMINATE could potentially make things worse, the same thing happens on readahead, since the new page can be locked for an indeterminate amount of time, which can also block compaction, right? What about explicitly opting fuse cache pages out of compaction by allocating them form ZONE_UNMOVABLE? Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-06 10:19 ` Miklos Szeredi @ 2025-01-06 18:17 ` Shakeel Butt 2025-01-07 8:34 ` David Hildenbrand 2025-01-07 16:15 ` Miklos Szeredi 0 siblings, 2 replies; 124+ messages in thread From: Shakeel Butt @ 2025-01-06 18:17 UTC (permalink / raw) To: Miklos Szeredi Cc: David Hildenbrand, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > In any case, having movable pages be turned unmovable due to persistent > > writaback is something that must be fixed, not worked around. Likely a > > good topic for LSF/MM. > > Yes, this seems a good cross fs-mm topic. > > So the issue discussed here is that movable pages used for fuse > page-cache cause a problems when memory needs to be compacted. The > problem is either that > > - the page is skipped, leaving the physical memory block unmovable > > - the compaction is blocked for an unbounded time > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > worse, the same thing happens on readahead, since the new page can be > locked for an indeterminate amount of time, which can also block > compaction, right? Yes locked pages are unmovable. How much of these locked pages/folios can be caused by untrusted fuse server? > > What about explicitly opting fuse cache pages out of compaction by > allocating them form ZONE_UNMOVABLE? This can be done but it will change the memory condition of the users/workloads/systems where page cache is the majority of the memory (i.e. majority of memory will be unmovable) and when such systems are overcommitted, weird corner cases will arise (failing high order allocations, long term fragmentation etc). In addition the memory behind CXL will become unusable for fuse folios. IMHO the transient unmovable state of fuse folios due to writeback is not an issue if we can show that untrusted fuse server can not cause unlimited folios under writeback for arbitrary long time. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-06 18:17 ` Shakeel Butt @ 2025-01-07 8:34 ` David Hildenbrand 2025-01-07 18:07 ` Shakeel Butt 2025-01-10 20:16 ` Jeff Layton 2025-01-07 16:15 ` Miklos Szeredi 1 sibling, 2 replies; 124+ messages in thread From: David Hildenbrand @ 2025-01-07 8:34 UTC (permalink / raw) To: Shakeel Butt, Miklos Szeredi Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 06.01.25 19:17, Shakeel Butt wrote: > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>> In any case, having movable pages be turned unmovable due to persistent >>> writaback is something that must be fixed, not worked around. Likely a >>> good topic for LSF/MM. >> >> Yes, this seems a good cross fs-mm topic. >> >> So the issue discussed here is that movable pages used for fuse >> page-cache cause a problems when memory needs to be compacted. The >> problem is either that >> >> - the page is skipped, leaving the physical memory block unmovable >> >> - the compaction is blocked for an unbounded time >> >> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >> worse, the same thing happens on readahead, since the new page can be >> locked for an indeterminate amount of time, which can also block >> compaction, right? Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be unmovable pages ever*. Not triggered by an untrusted source, not triggered by an trusted source. It's a violation of core-mm principles. Even if we have a timeout of 60s, making things like alloc_contig_page() wait for that long on writeback is broken and needs to be fixed. And the fix is not to skip these pages, that's a workaround. I'm hoping I can find an easy way to trigger this also with NFS. > > Yes locked pages are unmovable. How much of these locked pages/folios > can be caused by untrusted fuse server? > >> >> What about explicitly opting fuse cache pages out of compaction by >> allocating them form ZONE_UNMOVABLE? > > This can be done but it will change the memory condition of the > users/workloads/systems where page cache is the majority of the memory > (i.e. majority of memory will be unmovable) and when such systems are > overcommitted, weird corner cases will arise (failing high order > allocations, long term fragmentation etc). In addition the memory > behind CXL will become unusable for fuse folios. Yes. > > IMHO the transient unmovable state of fuse folios due to writeback is > not an issue if we can show that untrusted fuse server can not cause > unlimited folios under writeback for arbitrary long time. See above, I disagree. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-07 8:34 ` David Hildenbrand @ 2025-01-07 18:07 ` Shakeel Butt 2025-01-09 11:22 ` David Hildenbrand 2025-01-10 20:16 ` Jeff Layton 1 sibling, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2025-01-07 18:07 UTC (permalink / raw) To: David Hildenbrand Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: > On 06.01.25 19:17, Shakeel Butt wrote: > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > In any case, having movable pages be turned unmovable due to persistent > > > > writaback is something that must be fixed, not worked around. Likely a > > > > good topic for LSF/MM. > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > So the issue discussed here is that movable pages used for fuse > > > page-cache cause a problems when memory needs to be compacted. The > > > problem is either that > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > - the compaction is blocked for an unbounded time > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > worse, the same thing happens on readahead, since the new page can be > > > locked for an indeterminate amount of time, which can also block > > > compaction, right? > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be > unmovable pages ever*. Not triggered by an untrusted source, not triggered > by an trusted source. > > It's a violation of core-mm principles. The "must not be unmovable pages ever" is a very strong statement and we are violating it today and will keep violating it in future. Any page/folio under lock or writeback or have reference taken or have been isolated from their LRU is unmovable (most of the time for small period of time). These operations are being done all over the place in kernel. Miklos gave an example of readahead. The per-CPU LRU caches are another case where folios can get stuck for long period of time. Reclaim and compaction can isolate a lot of folios that they need to have too_many_isolated() checks. So, "must not be unmovable pages ever" is impractical. The point is that, yes we should aim to improve things but in iterations and "must not be unmovable pages ever" is not something we can achieve in one step. Though I doubt that state is practically achievable and to me something like a bound (time or amount) on the transient unmovable folios is more practical. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-07 18:07 ` Shakeel Butt @ 2025-01-09 11:22 ` David Hildenbrand 2025-01-10 20:28 ` Jeff Layton 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2025-01-09 11:22 UTC (permalink / raw) To: Shakeel Butt Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 07.01.25 19:07, Shakeel Butt wrote: > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: >> On 06.01.25 19:17, Shakeel Butt wrote: >>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>>>> In any case, having movable pages be turned unmovable due to persistent >>>>> writaback is something that must be fixed, not worked around. Likely a >>>>> good topic for LSF/MM. >>>> >>>> Yes, this seems a good cross fs-mm topic. >>>> >>>> So the issue discussed here is that movable pages used for fuse >>>> page-cache cause a problems when memory needs to be compacted. The >>>> problem is either that >>>> >>>> - the page is skipped, leaving the physical memory block unmovable >>>> >>>> - the compaction is blocked for an unbounded time >>>> >>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >>>> worse, the same thing happens on readahead, since the new page can be >>>> locked for an indeterminate amount of time, which can also block >>>> compaction, right? >> >> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these >> pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be >> unmovable pages ever*. Not triggered by an untrusted source, not triggered >> by an trusted source. >> >> It's a violation of core-mm principles. > > The "must not be unmovable pages ever" is a very strong statement and we > are violating it today and will keep violating it in future. Any > page/folio under lock or writeback or have reference taken or have been > isolated from their LRU is unmovable (most of the time for small period > of time). ^ this: "small period of time" is what I meant. Most of these things are known to not be problematic: retrying a couple of times makes it work, that's why migration keeps retrying. Again, as an example, we allow short-term O_DIRECT but disallow long-term page pinning. I think there were concerns at some point if O_DIRECT might also be problematic (I/O might take a while), but so far it was not a problem in practice that would make CMA allocations easily fail. vmsplice() is a known problem, because it behaves like O_DIRECT but actually triggers long-term pinning; IIRC David Howells has this on his todo list to fix. [I recall that seccomp disallows vmsplice by default right now] These operations are being done all over the place in kernel. > Miklos gave an example of readahead. I assume you mean "unmovable for a short time", correct, or can you point me at that specific example; I think I missed that. > The per-CPU LRU caches are another > case where folios can get stuck for long period of time. Which is why memory offlining disables the lru cache. See lru_cache_disable(). Other users that care about that drain the LRU on all cpus. > Reclaim and > compaction can isolate a lot of folios that they need to have > too_many_isolated() checks. So, "must not be unmovable pages ever" is > impractical. "must only be short-term unmovable", better? > > The point is that, yes we should aim to improve things but in iterations > and "must not be unmovable pages ever" is not something we can achieve > in one step. I agree with the "improve things in iterations", but as AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we are making things worse. And as this discussion has been going on for too long, to summarize my point: there exist conditions where pages are short-term unmovable, and possibly some to be fixed that turn pages long-term unmovable (e.g., vmsplice); that does not mean that we can freely add new conditions that turn movable pages unmovable long-term or even forever. Again, this might be a good LSF/MM topic. If I would have the capacity I would suggest a topic around which things are know to cause pages to be short-term or long-term unmovable/unsplittable, and which can be handled, which not. Maybe I'll find the time to propose that as a topic. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-09 11:22 ` David Hildenbrand @ 2025-01-10 20:28 ` Jeff Layton 2025-01-10 21:13 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-10 20:28 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: > On 07.01.25 19:07, Shakeel Butt wrote: > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > good topic for LSF/MM. > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > problem is either that > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > locked for an indeterminate amount of time, which can also block > > > > > compaction, right? > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered > > > by an trusted source. > > > > > > It's a violation of core-mm principles. > > > > The "must not be unmovable pages ever" is a very strong statement and we > > are violating it today and will keep violating it in future. Any > > page/folio under lock or writeback or have reference taken or have been > > isolated from their LRU is unmovable (most of the time for small period > > of time). > > ^ this: "small period of time" is what I meant. > > Most of these things are known to not be problematic: retrying a couple > of times makes it work, that's why migration keeps retrying. > > Again, as an example, we allow short-term O_DIRECT but disallow > long-term page pinning. I think there were concerns at some point if > O_DIRECT might also be problematic (I/O might take a while), but so far > it was not a problem in practice that would make CMA allocations easily > fail. > > vmsplice() is a known problem, because it behaves like O_DIRECT but > actually triggers long-term pinning; IIRC David Howells has this on his > todo list to fix. [I recall that seccomp disallows vmsplice by default > right now] > > These operations are being done all over the place in kernel. > > Miklos gave an example of readahead. > > I assume you mean "unmovable for a short time", correct, or can you > point me at that specific example; I think I missed that. > > > The per-CPU LRU caches are another > > case where folios can get stuck for long period of time. > > Which is why memory offlining disables the lru cache. See > lru_cache_disable(). Other users that care about that drain the LRU on > all cpus. > > > Reclaim and > > compaction can isolate a lot of folios that they need to have > > too_many_isolated() checks. So, "must not be unmovable pages ever" is > > impractical. > > "must only be short-term unmovable", better? > Still a little ambiguous. How short is "short-term"? Are we talking milliseconds or minutes? Imposing a hard timeout on writeback requests to unprivileged FUSE servers might give us a better guarantee of forward-progress, but it would probably have to be on the order of at least a minute or so to be workable. > > > > The point is that, yes we should aim to improve things but in iterations > > and "must not be unmovable pages ever" is not something we can achieve > > in one step. > > I agree with the "improve things in iterations", but as > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we > are making things worse. > > And as this discussion has been going on for too long, to summarize my > point: there exist conditions where pages are short-term unmovable, and > possibly some to be fixed that turn pages long-term unmovable (e.g., > vmsplice); that does not mean that we can freely add new conditions that > turn movable pages unmovable long-term or even forever. > > Again, this might be a good LSF/MM topic. If I would have the capacity I > would suggest a topic around which things are know to cause pages to be > short-term or long-term unmovable/unsplittable, and which can be > handled, which not. Maybe I'll find the time to propose that as a topic. > This does sound like great LSF/MM fodder! I predict that this session will run long! ;) -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 20:28 ` Jeff Layton @ 2025-01-10 21:13 ` David Hildenbrand 2025-01-10 22:00 ` Shakeel Butt 2025-01-10 23:11 ` Jeff Layton 0 siblings, 2 replies; 124+ messages in thread From: David Hildenbrand @ 2025-01-10 21:13 UTC (permalink / raw) To: Jeff Layton, Shakeel Butt Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 10.01.25 21:28, Jeff Layton wrote: > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: >> On 07.01.25 19:07, Shakeel Butt wrote: >>> On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: >>>> On 06.01.25 19:17, Shakeel Butt wrote: >>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>>>>>> In any case, having movable pages be turned unmovable due to persistent >>>>>>> writaback is something that must be fixed, not worked around. Likely a >>>>>>> good topic for LSF/MM. >>>>>> >>>>>> Yes, this seems a good cross fs-mm topic. >>>>>> >>>>>> So the issue discussed here is that movable pages used for fuse >>>>>> page-cache cause a problems when memory needs to be compacted. The >>>>>> problem is either that >>>>>> >>>>>> - the page is skipped, leaving the physical memory block unmovable >>>>>> >>>>>> - the compaction is blocked for an unbounded time >>>>>> >>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >>>>>> worse, the same thing happens on readahead, since the new page can be >>>>>> locked for an indeterminate amount of time, which can also block >>>>>> compaction, right? >>>> >>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these >>>> pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be >>>> unmovable pages ever*. Not triggered by an untrusted source, not triggered >>>> by an trusted source. >>>> >>>> It's a violation of core-mm principles. >>> >>> The "must not be unmovable pages ever" is a very strong statement and we >>> are violating it today and will keep violating it in future. Any >>> page/folio under lock or writeback or have reference taken or have been >>> isolated from their LRU is unmovable (most of the time for small period >>> of time). >> >> ^ this: "small period of time" is what I meant. >> >> Most of these things are known to not be problematic: retrying a couple >> of times makes it work, that's why migration keeps retrying. >> >> Again, as an example, we allow short-term O_DIRECT but disallow >> long-term page pinning. I think there were concerns at some point if >> O_DIRECT might also be problematic (I/O might take a while), but so far >> it was not a problem in practice that would make CMA allocations easily >> fail. >> >> vmsplice() is a known problem, because it behaves like O_DIRECT but >> actually triggers long-term pinning; IIRC David Howells has this on his >> todo list to fix. [I recall that seccomp disallows vmsplice by default >> right now] >> >> These operations are being done all over the place in kernel. >>> Miklos gave an example of readahead. >> >> I assume you mean "unmovable for a short time", correct, or can you >> point me at that specific example; I think I missed that. >> >>> The per-CPU LRU caches are another >>> case where folios can get stuck for long period of time. >> >> Which is why memory offlining disables the lru cache. See >> lru_cache_disable(). Other users that care about that drain the LRU on >> all cpus. >> >>> Reclaim and >>> compaction can isolate a lot of folios that they need to have >>> too_many_isolated() checks. So, "must not be unmovable pages ever" is >>> impractical. >> >> "must only be short-term unmovable", better? >> > > Still a little ambiguous. > > How short is "short-term"? Are we talking milliseconds or minutes? Usually a couple of seconds, max. For memory offlining, slightly longer times are acceptable; other things (in particular compaction or CMA allocations) will give up much faster. > > Imposing a hard timeout on writeback requests to unprivileged FUSE > servers might give us a better guarantee of forward-progress, but it > would probably have to be on the order of at least a minute or so to be > workable. Yes, and that might already be a bit too much, especially if stuck on waiting for folio writeback ... so ideally we could find a way to migrate these folios that are under writeback and it's not your ordinary disk driver that responds rather quickly. Right now we do it via these temp pages, and I can see how that's undesirable. For NFS etc. we probably never ran into this, because it's all used in fairly well managed environments and, well, I assume NFS easily outdates CMA and ZONE_MOVABLE :) > >>> >>> The point is that, yes we should aim to improve things but in iterations >>> and "must not be unmovable pages ever" is not something we can achieve >>> in one step. >> >> I agree with the "improve things in iterations", but as >> AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we >> are making things worse. >> >> And as this discussion has been going on for too long, to summarize my >> point: there exist conditions where pages are short-term unmovable, and >> possibly some to be fixed that turn pages long-term unmovable (e.g., >> vmsplice); that does not mean that we can freely add new conditions that >> turn movable pages unmovable long-term or even forever. >> >> Again, this might be a good LSF/MM topic. If I would have the capacity I >> would suggest a topic around which things are know to cause pages to be >> short-term or long-term unmovable/unsplittable, and which can be >> handled, which not. Maybe I'll find the time to propose that as a topic. >> > > > This does sound like great LSF/MM fodder! I predict that this session > will run long! ;) Heh, fully agreed! :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 21:13 ` David Hildenbrand @ 2025-01-10 22:00 ` Shakeel Butt 2025-01-13 15:27 ` David Hildenbrand 2025-01-10 23:11 ` Jeff Layton 1 sibling, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2025-01-10 22:00 UTC (permalink / raw) To: David Hildenbrand Cc: Jeff Layton, Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote: > On 10.01.25 21:28, Jeff Layton wrote: > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: > > > On 07.01.25 19:07, Shakeel Butt wrote: > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: > > > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > > > good topic for LSF/MM. > > > > > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > > > problem is either that > > > > > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > > > locked for an indeterminate amount of time, which can also block > > > > > > > compaction, right? > > > > > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered > > > > > by an trusted source. > > > > > > > > > > It's a violation of core-mm principles. > > > > > > > > The "must not be unmovable pages ever" is a very strong statement and we > > > > are violating it today and will keep violating it in future. Any > > > > page/folio under lock or writeback or have reference taken or have been > > > > isolated from their LRU is unmovable (most of the time for small period > > > > of time). > > > > > > ^ this: "small period of time" is what I meant. > > > > > > Most of these things are known to not be problematic: retrying a couple > > > of times makes it work, that's why migration keeps retrying. > > > > > > Again, as an example, we allow short-term O_DIRECT but disallow > > > long-term page pinning. I think there were concerns at some point if > > > O_DIRECT might also be problematic (I/O might take a while), but so far > > > it was not a problem in practice that would make CMA allocations easily > > > fail. > > > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but > > > actually triggers long-term pinning; IIRC David Howells has this on his > > > todo list to fix. [I recall that seccomp disallows vmsplice by default > > > right now] > > > > > > These operations are being done all over the place in kernel. > > > > Miklos gave an example of readahead. > > > > > > I assume you mean "unmovable for a short time", correct, or can you > > > point me at that specific example; I think I missed that. Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@mail.gmail.com/ > > > > > > > The per-CPU LRU caches are another > > > > case where folios can get stuck for long period of time. > > > > > > Which is why memory offlining disables the lru cache. See > > > lru_cache_disable(). Other users that care about that drain the LRU on > > > all cpus. > > > > > > > Reclaim and > > > > compaction can isolate a lot of folios that they need to have > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is > > > > impractical. > > > > > > "must only be short-term unmovable", better? Yes and you have clarified further below of the actual amount. > > > > > > > Still a little ambiguous. > > > > How short is "short-term"? Are we talking milliseconds or minutes? > > Usually a couple of seconds, max. For memory offlining, slightly longer > times are acceptable; other things (in particular compaction or CMA > allocations) will give up much faster. > > > > > Imposing a hard timeout on writeback requests to unprivileged FUSE > > servers might give us a better guarantee of forward-progress, but it > > would probably have to be on the order of at least a minute or so to be > > workable. > > Yes, and that might already be a bit too much, especially if stuck on > waiting for folio writeback ... so ideally we could find a way to migrate > these folios that are under writeback and it's not your ordinary disk driver > that responds rather quickly. > > Right now we do it via these temp pages, and I can see how that's > undesirable. > > For NFS etc. we probably never ran into this, because it's all used in > fairly well managed environments and, well, I assume NFS easily outdates CMA > and ZONE_MOVABLE :) > > > >>> > > > > The point is that, yes we should aim to improve things but in iterations > > > > and "must not be unmovable pages ever" is not something we can achieve > > > > in one step. > > > > > > I agree with the "improve things in iterations", but as > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we > > > are making things worse. AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still causing confusion. It is a simple flag to avoid deadlock in the reclaim code path and does not say anything about movability. > > > > > > And as this discussion has been going on for too long, to summarize my > > > point: there exist conditions where pages are short-term unmovable, and > > > possibly some to be fixed that turn pages long-term unmovable (e.g., > > > vmsplice); that does not mean that we can freely add new conditions that > > > turn movable pages unmovable long-term or even forever. > > > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I > > > would suggest a topic around which things are know to cause pages to be > > > short-term or long-term unmovable/unsplittable, and which can be > > > handled, which not. Maybe I'll find the time to propose that as a topic. > > > > > > > > > This does sound like great LSF/MM fodder! I predict that this session > > will run long! ;) > > Heh, fully agreed! :) I would like more targeted topic and for that I want us to at least agree where we are disagring. Let me write down two statements and please tell me where you disagree: 1. For a normal running FUSE server (without tmp pages), the lifetime of writeback state of fuse folios falls under "short-term unmovable" bucket as it does not differ in anyway from anyother filesystems handling writeback folios. 2. For a buggy or untrusted FUSE server (without tmp pages), the lifetime of writeback state of fuse folios can be arbitrarily long and we need some mechanism to limit it. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 22:00 ` Shakeel Butt @ 2025-01-13 15:27 ` David Hildenbrand 2025-01-13 21:44 ` Jeff Layton 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2025-01-13 15:27 UTC (permalink / raw) To: Shakeel Butt Cc: Jeff Layton, Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 10.01.25 23:00, Shakeel Butt wrote: > On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote: >> On 10.01.25 21:28, Jeff Layton wrote: >>> On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: >>>> On 07.01.25 19:07, Shakeel Butt wrote: >>>>> On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: >>>>>> On 06.01.25 19:17, Shakeel Butt wrote: >>>>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >>>>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>>>>>>>> In any case, having movable pages be turned unmovable due to persistent >>>>>>>>> writaback is something that must be fixed, not worked around. Likely a >>>>>>>>> good topic for LSF/MM. >>>>>>>> >>>>>>>> Yes, this seems a good cross fs-mm topic. >>>>>>>> >>>>>>>> So the issue discussed here is that movable pages used for fuse >>>>>>>> page-cache cause a problems when memory needs to be compacted. The >>>>>>>> problem is either that >>>>>>>> >>>>>>>> - the page is skipped, leaving the physical memory block unmovable >>>>>>>> >>>>>>>> - the compaction is blocked for an unbounded time >>>>>>>> >>>>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >>>>>>>> worse, the same thing happens on readahead, since the new page can be >>>>>>>> locked for an indeterminate amount of time, which can also block >>>>>>>> compaction, right? >>>>>> >>>>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these >>>>>> pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be >>>>>> unmovable pages ever*. Not triggered by an untrusted source, not triggered >>>>>> by an trusted source. >>>>>> >>>>>> It's a violation of core-mm principles. >>>>> >>>>> The "must not be unmovable pages ever" is a very strong statement and we >>>>> are violating it today and will keep violating it in future. Any >>>>> page/folio under lock or writeback or have reference taken or have been >>>>> isolated from their LRU is unmovable (most of the time for small period >>>>> of time). >>>> >>>> ^ this: "small period of time" is what I meant. >>>> >>>> Most of these things are known to not be problematic: retrying a couple >>>> of times makes it work, that's why migration keeps retrying. >>>> >>>> Again, as an example, we allow short-term O_DIRECT but disallow >>>> long-term page pinning. I think there were concerns at some point if >>>> O_DIRECT might also be problematic (I/O might take a while), but so far >>>> it was not a problem in practice that would make CMA allocations easily >>>> fail. >>>> >>>> vmsplice() is a known problem, because it behaves like O_DIRECT but >>>> actually triggers long-term pinning; IIRC David Howells has this on his >>>> todo list to fix. [I recall that seccomp disallows vmsplice by default >>>> right now] >>>> >>>> These operations are being done all over the place in kernel. >>>>> Miklos gave an example of readahead. >>>> >>>> I assume you mean "unmovable for a short time", correct, or can you >>>> point me at that specific example; I think I missed that. > > Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@mail.gmail.com/ > >>>> >>>>> The per-CPU LRU caches are another >>>>> case where folios can get stuck for long period of time. >>>> >>>> Which is why memory offlining disables the lru cache. See >>>> lru_cache_disable(). Other users that care about that drain the LRU on >>>> all cpus. >>>> >>>>> Reclaim and >>>>> compaction can isolate a lot of folios that they need to have >>>>> too_many_isolated() checks. So, "must not be unmovable pages ever" is >>>>> impractical. >>>> >>>> "must only be short-term unmovable", better? > > Yes and you have clarified further below of the actual amount. > >>>> >>> >>> Still a little ambiguous. >>> >>> How short is "short-term"? Are we talking milliseconds or minutes? >> >> Usually a couple of seconds, max. For memory offlining, slightly longer >> times are acceptable; other things (in particular compaction or CMA >> allocations) will give up much faster. >> >>> >>> Imposing a hard timeout on writeback requests to unprivileged FUSE >>> servers might give us a better guarantee of forward-progress, but it >>> would probably have to be on the order of at least a minute or so to be >>> workable. >> >> Yes, and that might already be a bit too much, especially if stuck on >> waiting for folio writeback ... so ideally we could find a way to migrate >> these folios that are under writeback and it's not your ordinary disk driver >> that responds rather quickly. >> >> Right now we do it via these temp pages, and I can see how that's >> undesirable. >> >> For NFS etc. we probably never ran into this, because it's all used in >> fairly well managed environments and, well, I assume NFS easily outdates CMA >> and ZONE_MOVABLE :) >> >>>>>> >>>>> The point is that, yes we should aim to improve things but in iterations >>>>> and "must not be unmovable pages ever" is not something we can achieve >>>>> in one step. >>>> >>>> I agree with the "improve things in iterations", but as >>>> AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we >>>> are making things worse. > > AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still > causing confusion. It is a simple flag to avoid deadlock in the reclaim > code path and does not say anything about movability. > >>>> >>>> And as this discussion has been going on for too long, to summarize my >>>> point: there exist conditions where pages are short-term unmovable, and >>>> possibly some to be fixed that turn pages long-term unmovable (e.g., >>>> vmsplice); that does not mean that we can freely add new conditions that >>>> turn movable pages unmovable long-term or even forever. >>>> >>>> Again, this might be a good LSF/MM topic. If I would have the capacity I >>>> would suggest a topic around which things are know to cause pages to be >>>> short-term or long-term unmovable/unsplittable, and which can be >>>> handled, which not. Maybe I'll find the time to propose that as a topic. >>>> >>> >>> >>> This does sound like great LSF/MM fodder! I predict that this session >>> will run long! ;) >> >> Heh, fully agreed! :) > > I would like more targeted topic and for that I want us to at least > agree where we are disagring. Let me write down two statements and > please tell me where you disagree: I think we're mostly in agreement! > > 1. For a normal running FUSE server (without tmp pages), the lifetime of > writeback state of fuse folios falls under "short-term unmovable" bucket > as it does not differ in anyway from anyother filesystems handling > writeback folios. That's the expectation, yes. As long as the FUSE server is able to make progress, the expectation is that it's just like NFS etc. If it isn't able to make progress (i.e., crash), the expectation is that everything will get cleaned up either way. I wonder if there could be valid scenario where the FUSE server is no longer able to make progress (ignoring network outages), or the progress might start being extremely slow such that it becomes a problem. In contrast to in-kernel FSs, one can do some fancy stuff with fuse where writing a page could possibly consume a lot of memory in user-space. Likely, in this case we might just blame it on the admin that agreed to running this (trusted) fuse server. > > 2. For a buggy or untrusted FUSE server (without tmp pages), the > lifetime of writeback state of fuse folios can be arbitrarily long and > we need some mechanism to limit it. Yes. Especially in 1), we really want to wait for writeback to finish, just like for any other filesystem. For 2), we want a way so writeback will not get stuck for a long time, but are able to make progress and migrate these pages. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-13 15:27 ` David Hildenbrand @ 2025-01-13 21:44 ` Jeff Layton 2025-01-14 8:38 ` Miklos Szeredi 0 siblings, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-13 21:44 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, 2025-01-13 at 16:27 +0100, David Hildenbrand wrote: > On 10.01.25 23:00, Shakeel Butt wrote: > > On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote: > > > On 10.01.25 21:28, Jeff Layton wrote: > > > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: > > > > > On 07.01.25 19:07, Shakeel Butt wrote: > > > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: > > > > > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > > > > > good topic for LSF/MM. > > > > > > > > > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > > > > > problem is either that > > > > > > > > > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > > > > > locked for an indeterminate amount of time, which can also block > > > > > > > > > compaction, right? > > > > > > > > > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these > > > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be > > > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered > > > > > > > by an trusted source. > > > > > > > > > > > > > > It's a violation of core-mm principles. > > > > > > > > > > > > The "must not be unmovable pages ever" is a very strong statement and we > > > > > > are violating it today and will keep violating it in future. Any > > > > > > page/folio under lock or writeback or have reference taken or have been > > > > > > isolated from their LRU is unmovable (most of the time for small period > > > > > > of time). > > > > > > > > > > ^ this: "small period of time" is what I meant. > > > > > > > > > > Most of these things are known to not be problematic: retrying a couple > > > > > of times makes it work, that's why migration keeps retrying. > > > > > > > > > > Again, as an example, we allow short-term O_DIRECT but disallow > > > > > long-term page pinning. I think there were concerns at some point if > > > > > O_DIRECT might also be problematic (I/O might take a while), but so far > > > > > it was not a problem in practice that would make CMA allocations easily > > > > > fail. > > > > > > > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but > > > > > actually triggers long-term pinning; IIRC David Howells has this on his > > > > > todo list to fix. [I recall that seccomp disallows vmsplice by default > > > > > right now] > > > > > > > > > > These operations are being done all over the place in kernel. > > > > > > Miklos gave an example of readahead. > > > > > > > > > > I assume you mean "unmovable for a short time", correct, or can you > > > > > point me at that specific example; I think I missed that. > > > > Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@mail.gmail.com/ > > > > > > > > > > > > > The per-CPU LRU caches are another > > > > > > case where folios can get stuck for long period of time. > > > > > > > > > > Which is why memory offlining disables the lru cache. See > > > > > lru_cache_disable(). Other users that care about that drain the LRU on > > > > > all cpus. > > > > > > > > > > > Reclaim and > > > > > > compaction can isolate a lot of folios that they need to have > > > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is > > > > > > impractical. > > > > > > > > > > "must only be short-term unmovable", better? > > > > Yes and you have clarified further below of the actual amount. > > > > > > > > > > > > > > > Still a little ambiguous. > > > > > > > > How short is "short-term"? Are we talking milliseconds or minutes? > > > > > > Usually a couple of seconds, max. For memory offlining, slightly longer > > > times are acceptable; other things (in particular compaction or CMA > > > allocations) will give up much faster. > > > > > > > > > > > Imposing a hard timeout on writeback requests to unprivileged FUSE > > > > servers might give us a better guarantee of forward-progress, but it > > > > would probably have to be on the order of at least a minute or so to be > > > > workable. > > > > > > Yes, and that might already be a bit too much, especially if stuck on > > > waiting for folio writeback ... so ideally we could find a way to migrate > > > these folios that are under writeback and it's not your ordinary disk driver > > > that responds rather quickly. > > > > > > Right now we do it via these temp pages, and I can see how that's > > > undesirable. > > > > > > For NFS etc. we probably never ran into this, because it's all used in > > > fairly well managed environments and, well, I assume NFS easily outdates CMA > > > and ZONE_MOVABLE :) > > > > > > > > > > > > > > > > The point is that, yes we should aim to improve things but in iterations > > > > > > and "must not be unmovable pages ever" is not something we can achieve > > > > > > in one step. > > > > > > > > > > I agree with the "improve things in iterations", but as > > > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we > > > > > are making things worse. > > > > AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still > > causing confusion. It is a simple flag to avoid deadlock in the reclaim > > code path and does not say anything about movability. > > > > > > > > > > > > And as this discussion has been going on for too long, to summarize my > > > > > point: there exist conditions where pages are short-term unmovable, and > > > > > possibly some to be fixed that turn pages long-term unmovable (e.g., > > > > > vmsplice); that does not mean that we can freely add new conditions that > > > > > turn movable pages unmovable long-term or even forever. > > > > > > > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I > > > > > would suggest a topic around which things are know to cause pages to be > > > > > short-term or long-term unmovable/unsplittable, and which can be > > > > > handled, which not. Maybe I'll find the time to propose that as a topic. > > > > > > > > > > > > > > > > > This does sound like great LSF/MM fodder! I predict that this session > > > > will run long! ;) > > > > > > Heh, fully agreed! :) > > > > I would like more targeted topic and for that I want us to at least > > agree where we are disagring. Let me write down two statements and > > please tell me where you disagree: > > I think we're mostly in agreement! > > > > > 1. For a normal running FUSE server (without tmp pages), the lifetime of > > writeback state of fuse folios falls under "short-term unmovable" bucket > > as it does not differ in anyway from anyother filesystems handling > > writeback folios. > > That's the expectation, yes. As long as the FUSE server is able to make > progress, the expectation is that it's just like NFS etc. If it isn't > able to make progress (i.e., crash), the expectation is that everything > will get cleaned up either way. > > I wonder if there could be valid scenario where the FUSE server is no > longer able to make progress (ignoring network outages), or the progress > might start being extremely slow such that it becomes a problem. In > contrast to in-kernel FSs, one can do some fancy stuff with fuse where > writing a page could possibly consume a lot of memory in user-space. > Likely, in this case we might just blame it on the admin that agreed to > running this (trusted) fuse server. > > > > > 2. For a buggy or untrusted FUSE server (without tmp pages), the > > lifetime of writeback state of fuse folios can be arbitrarily long and > > we need some mechanism to limit it. > > Yes. > > > Especially in 1), we really want to wait for writeback to finish, just > like for any other filesystem. For 2), we want a way so writeback will > not get stuck for a long time, but are able to make progress and migrate > these pages. > What if we were to allow the kernel to kill off an unprivileged FUSE server that was "misbehaving" [1], clean any dirty pagecache pages that it has, and set writeback errors on the corresponding FUSE inodes [2]? We'd still need a rather long timeout (on the order of at least a minute or so, by default). Would that be enough to assuage concerns about unprivileged servers pinning pages indefinitely? Buggy servers are still a problem, but there's not much we can do about that. There are a lot of details we'd have to sort out, so I'm also interested in whether anyone (Miklos? Bernd?) would find this basic approach objectionable. [1]: for some definition of misbehavior. Probably a writeback timeout of some sort but maybe there would be other criteria too. [2]: or maybe just make them eligible to be cleaned without talking to the server, should the VM wish it. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-13 21:44 ` Jeff Layton @ 2025-01-14 8:38 ` Miklos Szeredi 2025-01-14 9:40 ` Miklos Szeredi 2025-01-14 15:44 ` Jeff Layton 0 siblings, 2 replies; 124+ messages in thread From: Miklos Szeredi @ 2025-01-14 8:38 UTC (permalink / raw) To: Jeff Layton Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, 13 Jan 2025 at 22:44, Jeff Layton <jlayton@kernel.org> wrote: > What if we were to allow the kernel to kill off an unprivileged FUSE > server that was "misbehaving" [1], clean any dirty pagecache pages that > it has, and set writeback errors on the corresponding FUSE inodes [2]? > We'd still need a rather long timeout (on the order of at least a > minute or so, by default). How would this be different from Joanne's current request timeout patch? I think it makes sense, but it *has* to be opt in, for the same reason that NFS soft timeout is opt in, so it can't really solve the page migration issue generally. Also page reading has exactly the same issues, so fixing writeback is not enough. Maybe an explicit callback from the migration code to the filesystem would work. I.e. move the complexity of dealing with migration for problematic filesystems (netfs/fuse) to the filesystem itself. I'm not sure how this would actually look, as I'm unfamiliar with the details of page migration, but I guess it shouldn't be too difficult to implement for fuse at least. Thanks, Miklos > > Would that be enough to assuage concerns about unprivileged servers > pinning pages indefinitely? Buggy servers are still a problem, but > there's not much we can do about that. > > There are a lot of details we'd have to sort out, so I'm also > interested in whether anyone (Miklos? Bernd?) would find this basic > approach objectionable. > > [1]: for some definition of misbehavior. Probably a writeback > timeout of some sort but maybe there would be other criteria too. > > [2]: or maybe just make them eligible to be cleaned without talking to > the server, should the VM wish it. > -- > Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 8:38 ` Miklos Szeredi @ 2025-01-14 9:40 ` Miklos Szeredi 2025-01-14 9:55 ` Bernd Schubert 2025-01-14 15:49 ` Jeff Layton 2025-01-14 15:44 ` Jeff Layton 1 sibling, 2 replies; 124+ messages in thread From: Miklos Szeredi @ 2025-01-14 9:40 UTC (permalink / raw) To: Jeff Layton Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: > Maybe an explicit callback from the migration code to the filesystem > would work. I.e. move the complexity of dealing with migration for > problematic filesystems (netfs/fuse) to the filesystem itself. I'm > not sure how this would actually look, as I'm unfamiliar with the > details of page migration, but I guess it shouldn't be too difficult > to implement for fuse at least. Thinking a bit... 1) reading pages Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to ->readpages(), which may make the pages uptodate asynchronously. If a page is unlocked but not set uptodate, then caller is supposed to retry the reading, at least that's how I interpret filemap_get_pages(). This means that it's fine to migrate the page before it's actually filled with data, since the caller will retry. It also means that it would be sufficient to allocate the page itself just before filling it in, if there was a mechanism to keep track of these "not yet filled" pages. But that probably off topic. 2) writing pages When the page isn't actually being copied, the writeback could be cancelled and the page redirtied. At which point it's fine to migrate it. The problem is with pages that are spliced from /dev/fuse and control over when it's being accessed is lost. Note: this is not actually done right now on cached pages, since writeback always copies to temp pages. So we can continue to do that when doing a splice and not risk any performance regressions. Am I missing something? Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 9:40 ` Miklos Szeredi @ 2025-01-14 9:55 ` Bernd Schubert 2025-01-14 10:07 ` Miklos Szeredi 2025-01-14 15:49 ` Jeff Layton 1 sibling, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2025-01-14 9:55 UTC (permalink / raw) To: Miklos Szeredi, Jeff Layton Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 1/14/25 10:40, Miklos Szeredi wrote: > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: > >> Maybe an explicit callback from the migration code to the filesystem >> would work. I.e. move the complexity of dealing with migration for >> problematic filesystems (netfs/fuse) to the filesystem itself. I'm >> not sure how this would actually look, as I'm unfamiliar with the >> details of page migration, but I guess it shouldn't be too difficult >> to implement for fuse at least. > > Thinking a bit... > > 1) reading pages > > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to > ->readpages(), which may make the pages uptodate asynchronously. If a > page is unlocked but not set uptodate, then caller is supposed to > retry the reading, at least that's how I interpret > filemap_get_pages(). This means that it's fine to migrate the page > before it's actually filled with data, since the caller will retry. > > It also means that it would be sufficient to allocate the page itself > just before filling it in, if there was a mechanism to keep track of > these "not yet filled" pages. But that probably off topic. With /dev/fuse buffer copies should be easy - just allocate the page on buffer copy, control is in libfuse. With splice you really need a page state. > > 2) writing pages > > When the page isn't actually being copied, the writeback could be > cancelled and the page redirtied. At which point it's fine to migrate > it. The problem is with pages that are spliced from /dev/fuse and > control over when it's being accessed is lost. Note: this is not > actually done right now on cached pages, since writeback always copies > to temp pages. So we can continue to do that when doing a splice and > not risk any performance regressions. > I wrote this before already - what is the advantage of a tmp page copy over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here. Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 9:55 ` Bernd Schubert @ 2025-01-14 10:07 ` Miklos Szeredi 2025-01-14 18:07 ` Joanne Koong 2025-01-14 20:51 ` Joanne Koong 0 siblings, 2 replies; 124+ messages in thread From: Miklos Szeredi @ 2025-01-14 10:07 UTC (permalink / raw) To: Bernd Schubert Cc: Jeff Layton, David Hildenbrand, Shakeel Butt, Joanne Koong, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > On 1/14/25 10:40, Miklos Szeredi wrote: > > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: > > > >> Maybe an explicit callback from the migration code to the filesystem > >> would work. I.e. move the complexity of dealing with migration for > >> problematic filesystems (netfs/fuse) to the filesystem itself. I'm > >> not sure how this would actually look, as I'm unfamiliar with the > >> details of page migration, but I guess it shouldn't be too difficult > >> to implement for fuse at least. > > > > Thinking a bit... > > > > 1) reading pages > > > > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to > > ->readpages(), which may make the pages uptodate asynchronously. If a > > page is unlocked but not set uptodate, then caller is supposed to > > retry the reading, at least that's how I interpret > > filemap_get_pages(). This means that it's fine to migrate the page > > before it's actually filled with data, since the caller will retry. > > > > It also means that it would be sufficient to allocate the page itself > > just before filling it in, if there was a mechanism to keep track of > > these "not yet filled" pages. But that probably off topic. > > With /dev/fuse buffer copies should be easy - just allocate the page > on buffer copy, control is in libfuse. I think the issue is with generic page cache code, which currently relies on the PG_locked flag on the allocated but not yet filled page. If the generic code would be able to keep track of "under construction" ranges without relying on an allocated page, then the filesystem could allocate the page just before copying the data, insert the page into the cache mark the relevant portion of the file uptodate. > With splice you really need > a page state. It's not possible to splice a not-uptodate page. > I wrote this before already - what is the advantage of a tmp page copy > over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here. Splice seems a dead end, but we probably need to continue supporting it for a while for backward compatibility. Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 10:07 ` Miklos Szeredi @ 2025-01-14 18:07 ` Joanne Koong 2025-01-14 18:58 ` Miklos Szeredi 2025-01-14 20:51 ` Joanne Koong 1 sibling, 1 reply; 124+ messages in thread From: Joanne Koong @ 2025-01-14 18:07 UTC (permalink / raw) To: Miklos Szeredi Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, Jan 14, 2025 at 2:07 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > > > > > On 1/14/25 10:40, Miklos Szeredi wrote: > > > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > >> Maybe an explicit callback from the migration code to the filesystem > > >> would work. I.e. move the complexity of dealing with migration for > > >> problematic filesystems (netfs/fuse) to the filesystem itself. I'm > > >> not sure how this would actually look, as I'm unfamiliar with the > > >> details of page migration, but I guess it shouldn't be too difficult > > >> to implement for fuse at least. > > > > > > Thinking a bit... > > > > > > 1) reading pages > > > > > > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to > > > ->readpages(), which may make the pages uptodate asynchronously. If a > > > page is unlocked but not set uptodate, then caller is supposed to > > > retry the reading, at least that's how I interpret > > > filemap_get_pages(). This means that it's fine to migrate the page > > > before it's actually filled with data, since the caller will retry. > > > > > > It also means that it would be sufficient to allocate the page itself > > > just before filling it in, if there was a mechanism to keep track of > > > these "not yet filled" pages. But that probably off topic. > > > > With /dev/fuse buffer copies should be easy - just allocate the page > > on buffer copy, control is in libfuse. > > I think the issue is with generic page cache code, which currently > relies on the PG_locked flag on the allocated but not yet filled page. > If the generic code would be able to keep track of "under > construction" ranges without relying on an allocated page, then the > filesystem could allocate the page just before copying the data, > insert the page into the cache mark the relevant portion of the file > uptodate. > > > With splice you really need > > a page state. > > It's not possible to splice a not-uptodate page. > > > I wrote this before already - what is the advantage of a tmp page copy > > over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here. > > Splice seems a dead end, but we probably need to continue supporting > it for a while for backward compatibility. > There was a previous discussion about splice and tmp pages here [1], I see the following issues with having splice default to using tmp pages as a workaround: - my understanding is that the majority of use cases do use splice (eg iirc, libfuse does as well), in which case there's no point to this patchset then - codewise, imo this gets messy (eg we would still need the rb tree and would now need to check writeback against folio writeback state and against the rb tree) - for the large folios work in [2], the implementation imo is pretty clean because it's rebased on top of this patchset that removes the tmp pages and rb tree. If we still have tmp pages, then this gets very gnarly. There's not a good way I see to handle large folios in the rb tree given this scenario: a) writeback on a large folio is issued b) we copy it to a tmp folio and clear writeback on it since it's being spliced, we add this writeback request to the rb tree c) the folio in the pagecache is evicted d) another write occurs on a larger range that encompasses the range in the writeback in a) or on a subset of it Maybe this is doable with some other data structure instead of the rb tree (eg an xarray with refcounts maybe?), but it'd be ideal if we could find a solution (my guess is this would have to come from the the mm layer?) that obviates tmp pages altogether. Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1YwNw7C=EMfKQzN88Zq_2Qih5Te_bfkeaOf=tG+L3u9eA@mail.gmail.com/ [2] https://lore.kernel.org/linux-fsdevel/20241213221818.322371-1-joannelkoong@gmail.com/ > Thanks, > Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 18:07 ` Joanne Koong @ 2025-01-14 18:58 ` Miklos Szeredi 2025-01-14 19:12 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: Miklos Szeredi @ 2025-01-14 18:58 UTC (permalink / raw) To: Joanne Koong Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote: > - my understanding is that the majority of use cases do use splice (eg > iirc, libfuse does as well), in which case there's no point to this > patchset then If it turns out that non-splice writes are more performant, then libfuse can be fixed to use non-splice by default. It's not as clear cut though, since write through (which is also the default in libfuse, AFAIK) should not be affected by all this, since that never used tmp pages. > - codewise, imo this gets messy (eg we would still need the rb tree > and would now need to check writeback against folio writeback state > and against the rb tree) I'm thinking of something slightly different: remove the current tmp page mess, but instead of duplicating a page ref on splice, fall back to copying the cache page (see the user_pages case in fuse_copy_page()). This should have very similar performance to what we have today, but allows us to deal with page accesses the same way for both regular and splice I/O on /dev/fuse. Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 18:58 ` Miklos Szeredi @ 2025-01-14 19:12 ` Joanne Koong 2025-01-14 20:00 ` Miklos Szeredi 2025-01-14 20:29 ` Jeff Layton 0 siblings, 2 replies; 124+ messages in thread From: Joanne Koong @ 2025-01-14 19:12 UTC (permalink / raw) To: Miklos Szeredi Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, Jan 14, 2025 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote: > > > - my understanding is that the majority of use cases do use splice (eg > > iirc, libfuse does as well), in which case there's no point to this > > patchset then > > If it turns out that non-splice writes are more performant, then > libfuse can be fixed to use non-splice by default. It's not as clear > cut though, since write through (which is also the default in libfuse, > AFAIK) should not be affected by all this, since that never used tmp > pages. My thinking was that spliced writes without tmp pages would be fastest, then non-splice writes w/out tmp pages and spliced writes w/ would be roughly the same. But i'd need to benchmark and verify this assumption. > > > - codewise, imo this gets messy (eg we would still need the rb tree > > and would now need to check writeback against folio writeback state > > and against the rb tree) > > I'm thinking of something slightly different: remove the current tmp > page mess, but instead of duplicating a page ref on splice, fall back > to copying the cache page (see the user_pages case in > fuse_copy_page()). This should have very similar performance to what > we have today, but allows us to deal with page accesses the same way > for both regular and splice I/O on /dev/fuse. If we copy the cache page, do we not have the same issue with needing an rb tree to track writeback state since writeback on the original folio would be immediately cleared? Thanks, Joanne > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 19:12 ` Joanne Koong @ 2025-01-14 20:00 ` Miklos Szeredi 2025-01-14 20:29 ` Jeff Layton 1 sibling, 0 replies; 124+ messages in thread From: Miklos Szeredi @ 2025-01-14 20:00 UTC (permalink / raw) To: Joanne Koong Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 14 Jan 2025 at 20:12, Joanne Koong <joannelkoong@gmail.com> wrote: > If we copy the cache page, do we not have the same issue with needing > an rb tree to track writeback state since writeback on the original > folio would be immediately cleared? Writeback would not be cleared in that case. The copy would be to guarantee that the page can be migrated. Starting migration for an under-writeback page would need some new mechanism, because currently that's not possible. But I realize now that even though write-through does not involve PG_writeback, doing splice will result in those cache pages being referenced for an indefinite amount of time, which can deny migration. Ugh. Same as page reading, this exists today. Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 19:12 ` Joanne Koong 2025-01-14 20:00 ` Miklos Szeredi @ 2025-01-14 20:29 ` Jeff Layton 2025-01-14 21:40 ` Bernd Schubert 1 sibling, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-14 20:29 UTC (permalink / raw) To: Joanne Koong, Miklos Szeredi Cc: Bernd Schubert, David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 2025-01-14 at 11:12 -0800, Joanne Koong wrote: > On Tue, Jan 14, 2025 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > - my understanding is that the majority of use cases do use splice (eg > > > iirc, libfuse does as well), in which case there's no point to this > > > patchset then > > > > If it turns out that non-splice writes are more performant, then > > libfuse can be fixed to use non-splice by default. It's not as clear > > cut though, since write through (which is also the default in libfuse, > > AFAIK) should not be affected by all this, since that never used tmp > > pages. > > My thinking was that spliced writes without tmp pages would be > fastest, then non-splice writes w/out tmp pages and spliced writes w/ > would be roughly the same. But i'd need to benchmark and verify this > assumption. > A somewhat related question: is Bernd's io_uring patchset susceptible to the same problem as splice() in this situation? IOW, does the kernel inline pagecache pages into the io_uring buffers? If it doesn't have the same issue, then maybe we should think about using that to make a clean behavior break. Gate large folios and not using bounce pages behind io_uring. That would mean dealing with multiple IO paths, but that might still be simpler than trying to deal with multiple folio sizes in the writeback rbtree tracking. > > > > > - codewise, imo this gets messy (eg we would still need the rb tree > > > and would now need to check writeback against folio writeback state > > > and against the rb tree) > > > > I'm thinking of something slightly different: remove the current tmp > > page mess, but instead of duplicating a page ref on splice, fall back > > to copying the cache page (see the user_pages case in > > fuse_copy_page()). This should have very similar performance to what > > we have today, but allows us to deal with page accesses the same way > > for both regular and splice I/O on /dev/fuse. > > If we copy the cache page, do we not have the same issue with needing > an rb tree to track writeback state since writeback on the original > folio would be immediately cleared? > -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 20:29 ` Jeff Layton @ 2025-01-14 21:40 ` Bernd Schubert 2025-01-23 16:06 ` Pavel Begunkov 0 siblings, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2025-01-14 21:40 UTC (permalink / raw) To: Jeff Layton, Joanne Koong, Miklos Szeredi Cc: David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko, David Wei, Ming Lei, Pavel Begunkov, Jens Axboe On 1/14/25 21:29, Jeff Layton wrote: > On Tue, 2025-01-14 at 11:12 -0800, Joanne Koong wrote: >> On Tue, Jan 14, 2025 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote: >>> >>> On Tue, 14 Jan 2025 at 19:08, Joanne Koong <joannelkoong@gmail.com> wrote: >>> >>>> - my understanding is that the majority of use cases do use splice (eg >>>> iirc, libfuse does as well), in which case there's no point to this >>>> patchset then >>> >>> If it turns out that non-splice writes are more performant, then >>> libfuse can be fixed to use non-splice by default. It's not as clear >>> cut though, since write through (which is also the default in libfuse, >>> AFAIK) should not be affected by all this, since that never used tmp >>> pages. >> >> My thinking was that spliced writes without tmp pages would be >> fastest, then non-splice writes w/out tmp pages and spliced writes w/ >> would be roughly the same. But i'd need to benchmark and verify this >> assumption. >> > > A somewhat related question: is Bernd's io_uring patchset susceptible > to the same problem as splice() in this situation? IOW, does the kernel > inline pagecache pages into the io_uring buffers? Right now it does a full copy, similar as non-splice /dev/fuse read/write. I.e. it doesn't have zero copy either yet. > > If it doesn't have the same issue, then maybe we should think about > using that to make a clean behavior break. Gate large folios and not > using bounce pages behind io_uring. > > That would mean dealing with multiple IO paths, but that might still be > simpler than trying to deal with multiple folio sizes in the writeback > rbtree tracking. My personal thinking regarding ZC was to hook into Mings work, I didn't into deep details but from interface point of view it sounded nice, like - Application write - fuse-client/kernel request/CQEs with write attempts - fuse server prepares group SQE, group leader prepares the write buffer, other group members are consumers using their buffer part for the final destination - release of leader buffer when other group members are done Though, Pavel and Jens have concerns and have a different suggestion and at least the example Pavel gave looks like splice https://lore.kernel.org/all/f3a83b6a-c4b9-4933-998d-ebd1d09e3405@gmail.com/ I think David is looking into a different ZC solution, but I don't have details on that. Maybe fuse-io-uring and ublk splice approach should be another LSFMM topic. Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 21:40 ` Bernd Schubert @ 2025-01-23 16:06 ` Pavel Begunkov 0 siblings, 0 replies; 124+ messages in thread From: Pavel Begunkov @ 2025-01-23 16:06 UTC (permalink / raw) To: Bernd Schubert, Jeff Layton, Joanne Koong, Miklos Szeredi Cc: David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko, David Wei, Ming Lei, Jens Axboe On 1/14/25 21:40, Bernd Schubert wrote: ... > My personal thinking regarding ZC was to hook into Mings work, I > didn't into deep details but from interface point of view it sounded > nice, like > > - Application write > - fuse-client/kernel request/CQEs with write attempts > - fuse server prepares group SQE, group leader prepares > the write buffer, other group members are consumers > using their buffer part for the final destination > - release of leader buffer when other group members > are done > > > Though, Pavel and Jens have concerns and have a different suggestion > and at least the example Pavel gave looks like splice That's the same approach but with adjusted api, i.e. instead of caging into groups it uses an io_uring private table, but in both cases one request provides a buffer, subsequent requests do IO with that buffer. And fwiw, it has nothing to do with pipes. > https://lore.kernel.org/all/f3a83b6a-c4b9-4933-998d-ebd1d09e3405@gmail.com/ That one is simple and easy to maintain, we can trivially pick it up if needed. > I think David is looking into a different ZC solution, but I > don't have details on that. > Maybe fuse-io-uring and ublk splice approach should be another LSFMM > topic. Unfortunately, I won't make it, but maybe Jens is planning to go. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 10:07 ` Miklos Szeredi 2025-01-14 18:07 ` Joanne Koong @ 2025-01-14 20:51 ` Joanne Koong 2025-01-24 12:25 ` David Hildenbrand 1 sibling, 1 reply; 124+ messages in thread From: Joanne Koong @ 2025-01-14 20:51 UTC (permalink / raw) To: Miklos Szeredi Cc: Bernd Schubert, Jeff Layton, David Hildenbrand, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, Jan 14, 2025 at 2:07 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > > > > > On 1/14/25 10:40, Miklos Szeredi wrote: > > > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > >> Maybe an explicit callback from the migration code to the filesystem > > >> would work. I.e. move the complexity of dealing with migration for > > >> problematic filesystems (netfs/fuse) to the filesystem itself. I'm > > >> not sure how this would actually look, as I'm unfamiliar with the > > >> details of page migration, but I guess it shouldn't be too difficult > > >> to implement for fuse at least. > > > > > > Thinking a bit... > > > > > > 1) reading pages > > > > > > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to > > > ->readpages(), which may make the pages uptodate asynchronously. If a > > > page is unlocked but not set uptodate, then caller is supposed to > > > retry the reading, at least that's how I interpret > > > filemap_get_pages(). This means that it's fine to migrate the page > > > before it's actually filled with data, since the caller will retry. > > > > > > It also means that it would be sufficient to allocate the page itself > > > just before filling it in, if there was a mechanism to keep track of > > > these "not yet filled" pages. But that probably off topic. > > > > With /dev/fuse buffer copies should be easy - just allocate the page > > on buffer copy, control is in libfuse. > > I think the issue is with generic page cache code, which currently > relies on the PG_locked flag on the allocated but not yet filled page. > If the generic code would be able to keep track of "under > construction" ranges without relying on an allocated page, then the > filesystem could allocate the page just before copying the data, > insert the page into the cache mark the relevant portion of the file > uptodate. > > > With splice you really need > > a page state. > > It's not possible to splice a not-uptodate page. > > > I wrote this before already - what is the advantage of a tmp page copy > > over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here. > > Splice seems a dead end, but we probably need to continue supporting > it for a while for backward compatibility. For the splice case, could we do something like this or is this too invasive?: * in mm, add a flag that marks a page as either being in migration or temporarily blocking migration * in splice, when we have to access the page in the pipe buffer, check if that flag is set and wait for the migration to complete before proceeding * in splice, set that flag while it's accessing the page, which will only temporarily block migration (eg for the duration of the memcpy) I guess this is basically what the page lock is for, but with less overhead? I need to look more at the splice code to see how it works, but something like this would allow us to cancel writeback on spliced pages that have already been sent to userspace if the request is taking too long, and migration would never get stalled. Though I guess the flag would be pretty specific only to the migration use case, which might be a waste of a bit. Thanks, Joanne > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 20:51 ` Joanne Koong @ 2025-01-24 12:25 ` David Hildenbrand 0 siblings, 0 replies; 124+ messages in thread From: David Hildenbrand @ 2025-01-24 12:25 UTC (permalink / raw) To: Joanne Koong, Miklos Szeredi Cc: Bernd Schubert, Jeff Layton, Shakeel Butt, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 14.01.25 21:51, Joanne Koong wrote: > On Tue, Jan 14, 2025 at 2:07 AM Miklos Szeredi <miklos@szeredi.hu> wrote: >> >> On Tue, 14 Jan 2025 at 10:55, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: >>> >>> >>> >>> On 1/14/25 10:40, Miklos Szeredi wrote: >>>> On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: >>>> >>>>> Maybe an explicit callback from the migration code to the filesystem >>>>> would work. I.e. move the complexity of dealing with migration for >>>>> problematic filesystems (netfs/fuse) to the filesystem itself. I'm >>>>> not sure how this would actually look, as I'm unfamiliar with the >>>>> details of page migration, but I guess it shouldn't be too difficult >>>>> to implement for fuse at least. >>>> >>>> Thinking a bit... >>>> >>>> 1) reading pages >>>> >>>> Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to >>>> ->readpages(), which may make the pages uptodate asynchronously. If a >>>> page is unlocked but not set uptodate, then caller is supposed to >>>> retry the reading, at least that's how I interpret >>>> filemap_get_pages(). This means that it's fine to migrate the page >>>> before it's actually filled with data, since the caller will retry. >>>> >>>> It also means that it would be sufficient to allocate the page itself >>>> just before filling it in, if there was a mechanism to keep track of >>>> these "not yet filled" pages. But that probably off topic. >>> >>> With /dev/fuse buffer copies should be easy - just allocate the page >>> on buffer copy, control is in libfuse. >> >> I think the issue is with generic page cache code, which currently >> relies on the PG_locked flag on the allocated but not yet filled page. >> If the generic code would be able to keep track of "under >> construction" ranges without relying on an allocated page, then the >> filesystem could allocate the page just before copying the data, >> insert the page into the cache mark the relevant portion of the file >> uptodate. >> >>> With splice you really need >>> a page state. >> >> It's not possible to splice a not-uptodate page. >> >>> I wrote this before already - what is the advantage of a tmp page copy >>> over /dev/fuse buffer copy? I.e. I wonder if we need splice at all here. >> >> Splice seems a dead end, but we probably need to continue supporting >> it for a while for backward compatibility. > > For the splice case, could we do something like this or is this too invasive?: > * in mm, add a flag that marks a page as either being in migration or > temporarily blocking migration > * in splice, when we have to access the page in the pipe buffer, check > if that flag is set and wait for the migration to complete before > proceeding > * in splice, set that flag while it's accessing the page, which will > only temporarily block migration (eg for the duration of the memcpy) > > I guess this is basically what the page lock is for, but with less overhead? Yes, the folio lock kind-of behaves that way. One problem might be, that while the page is spliced that there is a raised refcount on the page: migration cannot make progress if there are unknown references. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 9:40 ` Miklos Szeredi 2025-01-14 9:55 ` Bernd Schubert @ 2025-01-14 15:49 ` Jeff Layton 2025-01-24 12:29 ` David Hildenbrand 1 sibling, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-14 15:49 UTC (permalink / raw) To: Miklos Szeredi Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 2025-01-14 at 10:40 +0100, Miklos Szeredi wrote: > On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: > > > Maybe an explicit callback from the migration code to the filesystem > > would work. I.e. move the complexity of dealing with migration for > > problematic filesystems (netfs/fuse) to the filesystem itself. I'm > > not sure how this would actually look, as I'm unfamiliar with the > > details of page migration, but I guess it shouldn't be too difficult > > to implement for fuse at least. > > Thinking a bit... > > 1) reading pages > > Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to > ->readpages(), which may make the pages uptodate asynchronously. If a > page is unlocked but not set uptodate, then caller is supposed to > retry the reading, at least that's how I interpret > filemap_get_pages(). This means that it's fine to migrate the page > before it's actually filled with data, since the caller will retry. > > It also means that it would be sufficient to allocate the page itself > just before filling it in, if there was a mechanism to keep track of > these "not yet filled" pages. But that probably off topic. > Sounds plausible. > 2) writing pages > > When the page isn't actually being copied, the writeback could be > cancelled and the page redirtied. At which point it's fine to migrate > it. The problem is with pages that are spliced from /dev/fuse and > control over when it's being accessed is lost. Note: this is not > actually done right now on cached pages, since writeback always copies > to temp pages. So we can continue to do that when doing a splice and > not risk any performance regressions. > Can we just cancel and redirty the page like that when doing a WB_SYNC_ALL flush? I think we'd need to ensure that it gets a new writeback attempt as soon as the migration is done if that's in progress, no? -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 15:49 ` Jeff Layton @ 2025-01-24 12:29 ` David Hildenbrand 2025-01-28 10:16 ` Miklos Szeredi 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2025-01-24 12:29 UTC (permalink / raw) To: Jeff Layton, Miklos Szeredi Cc: Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 14.01.25 16:49, Jeff Layton wrote: > On Tue, 2025-01-14 at 10:40 +0100, Miklos Szeredi wrote: >> On Tue, 14 Jan 2025 at 09:38, Miklos Szeredi <miklos@szeredi.hu> wrote: >> >>> Maybe an explicit callback from the migration code to the filesystem >>> would work. I.e. move the complexity of dealing with migration for >>> problematic filesystems (netfs/fuse) to the filesystem itself. I'm >>> not sure how this would actually look, as I'm unfamiliar with the >>> details of page migration, but I guess it shouldn't be too difficult >>> to implement for fuse at least. >> >> Thinking a bit... >> >> 1) reading pages >> >> Pages are allocated (PG_locked set, PG_uptodate cleared) and passed to >> ->readpages(), which may make the pages uptodate asynchronously. If a >> page is unlocked but not set uptodate, then caller is supposed to >> retry the reading, at least that's how I interpret >> filemap_get_pages(). This means that it's fine to migrate the page >> before it's actually filled with data, since the caller will retry. >> >> It also means that it would be sufficient to allocate the page itself >> just before filling it in, if there was a mechanism to keep track of >> these "not yet filled" pages. But that probably off topic. >> > > Sounds plausible. > >> 2) writing pages >> >> When the page isn't actually being copied, the writeback could be >> cancelled and the page redirtied. At which point it's fine to migrate >> it. The problem is with pages that are spliced from /dev/fuse and >> control over when it's being accessed is lost. Note: this is not >> actually done right now on cached pages, since writeback always copies >> to temp pages. So we can continue to do that when doing a splice and >> not risk any performance regressions. >> > > Can we just cancel and redirty the page like that when doing a > WB_SYNC_ALL flush? I think we'd need to ensure that it gets a new > writeback attempt as soon as the migration is done if that's in > progress, no? Yeah, that was one of my initial questions as well: could one "transparently" (to user space) handle canceling writeback and simply re-dirty the page. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-24 12:29 ` David Hildenbrand @ 2025-01-28 10:16 ` Miklos Szeredi 0 siblings, 0 replies; 124+ messages in thread From: Miklos Szeredi @ 2025-01-28 10:16 UTC (permalink / raw) To: David Hildenbrand Cc: Jeff Layton, Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, 24 Jan 2025 at 13:29, David Hildenbrand <david@redhat.com> wrote: > Yeah, that was one of my initial questions as well: could one > "transparently" (to user space) handle canceling writeback and simply > re-dirty the page. 1) WRITE request is not yet dequeued by userspace: the writeback can be cancelled 2/a) WRITE request is dequeued (copied) to userspace: the page can be reused, but the writeback isn't yet complete. Calling folio_end_writeback() is lying in the same sense that it's lying with temp pages. 2/b) WRITE request is dequeued (spliced) to userspace: the page is referenced indefinitely (could even be after the writeback completes). Temp page could be allocated at splice time, which means performance will be no better than with current temp page writeback, but at least it will be less complex. 3) WRITE request is currently being copied to userspace: this should normally be short, but userspace can be nasty and have the buffer be an mmap of another fuse file, and make the copy hang in the middle by triggering a page fault. The request cannot be cancelled at this point. In such a case the "echo 1 > /sys/fs/fuse/connections/##/abort" mechanism or the upcoming server timeout can be used to shutdown the filesystem. So this is definitely more complicated than I'd like. Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 8:38 ` Miklos Szeredi 2025-01-14 9:40 ` Miklos Szeredi @ 2025-01-14 15:44 ` Jeff Layton 2025-01-14 18:58 ` Joanne Koong 1 sibling, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-14 15:44 UTC (permalink / raw) To: Miklos Szeredi Cc: David Hildenbrand, Shakeel Butt, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 2025-01-14 at 09:38 +0100, Miklos Szeredi wrote: > On Mon, 13 Jan 2025 at 22:44, Jeff Layton <jlayton@kernel.org> wrote: > > > What if we were to allow the kernel to kill off an unprivileged FUSE > > server that was "misbehaving" [1], clean any dirty pagecache pages that > > it has, and set writeback errors on the corresponding FUSE inodes [2]? > > We'd still need a rather long timeout (on the order of at least a > > minute or so, by default). > > How would this be different from Joanne's current request timeout patch? > When the timeout pops with Joanne's set, the pages still remain dirty (IIUC). The idea here would be that after a call times out and we've decided the server is "misbehaving", we'd want to clean the pages and mark the inode with a writeback error. That frees up the page to be migrated, but a later msync or fsync should return an error. This is the standard behavior for writeback errors on filesystems. > I think it makes sense, but it *has* to be opt in, for the same reason > that NFS soft timeout is opt in, so it can't really solve the page > migration issue generally. > Does it really need to be though? We're talking unprivileged mounts here. Imposing a hard timeout on reads or writes as a mechanism to limit resource consumption by an unprivileged user seems like a reasonable thing to do. Writeback errors suck, but what other recourse do we have in this situation? We could also consider only enforcing this when memory gets low, or a migration has failed. > Also page reading has exactly the same issues, so fixing writeback is > not enough. > Reads are synchronous, so we could just return an error directly on those. > Maybe an explicit callback from the migration code to the filesystem > would work. I.e. move the complexity of dealing with migration for > problematic filesystems (netfs/fuse) to the filesystem itself. I'm > not sure how this would actually look, as I'm unfamiliar with the > details of page migration, but I guess it shouldn't be too difficult > to implement for fuse at least. > We already have a ->migrate_folio operation. Maybe we could consider pushing down the PG_writeback check into the ->migrate_folio ops? As an initial step, we could just make them all return -EBUSY, and then allow some (like FUSE) to handle the situation properly. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-14 15:44 ` Jeff Layton @ 2025-01-14 18:58 ` Joanne Koong 0 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2025-01-14 18:58 UTC (permalink / raw) To: Jeff Layton Cc: Miklos Szeredi, David Hildenbrand, Shakeel Butt, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, Jan 14, 2025 at 7:44 AM Jeff Layton <jlayton@kernel.org> wrote: > > On Tue, 2025-01-14 at 09:38 +0100, Miklos Szeredi wrote: > > On Mon, 13 Jan 2025 at 22:44, Jeff Layton <jlayton@kernel.org> wrote: > > > > > What if we were to allow the kernel to kill off an unprivileged FUSE > > > server that was "misbehaving" [1], clean any dirty pagecache pages that > > > it has, and set writeback errors on the corresponding FUSE inodes [2]? > > > We'd still need a rather long timeout (on the order of at least a > > > minute or so, by default). > > > > How would this be different from Joanne's current request timeout patch? > > > > When the timeout pops with Joanne's set, the pages still remain dirty > (IIUC). The idea here would be that after a call times out and we've > decided the server is "misbehaving", we'd want to clean the pages and > mark the inode with a writeback error. That frees up the page to be > migrated, but a later msync or fsync should return an error. This is > the standard behavior for writeback errors on filesystems. I think the pages already get cleaned and the inode marked with an error in the case of a timeout. The timeout calls into the abort path, so the abort path should already be doing this. When the connection is aborted, fuse_request_end() will get invoked, which will call the req->args->end() callback which for writebacks will be fuse_writepage_end(). In fuse_writepage_end(), the inode->i_mapping gets set to the error code and the writeback state will be cleared on the folio as well (in fuse_writepage_finish()). > > > I think it makes sense, but it *has* to be opt in, for the same reason > > that NFS soft timeout is opt in, so it can't really solve the page > > migration issue generally. > > > > Does it really need to be though? We're talking unprivileged mounts > here. Imposing a hard timeout on reads or writes as a mechanism to > limit resource consumption by an unprivileged user seems like a > reasonable thing to do. Writeback errors suck, but what other recourse > do we have in this situation? > > We could also consider only enforcing this when memory gets low, or a > migration has failed. > I think there's a case to be made here that this "resource checking" of unprivileged mounts should be behavior that already exists (eg automatically enforcing timeouts instead of only by opt-in). The only issue with this I see is that it might potentially break backwards-compatibility, but I think it could be argued that protecting memory resources outweighs that. Though the timeout would have to be somewhat large, and I don't know if that would be acceptable for migration. Thanks, Joanne > > Also page reading has exactly the same issues, so fixing writeback is > > not enough. > > > > Reads are synchronous, so we could just return an error directly on > those. > > > Maybe an explicit callback from the migration code to the filesystem > > would work. I.e. move the complexity of dealing with migration for > > problematic filesystems (netfs/fuse) to the filesystem itself. I'm > > not sure how this would actually look, as I'm unfamiliar with the > > details of page migration, but I guess it shouldn't be too difficult > > to implement for fuse at least. > > > > We already have a ->migrate_folio operation. Maybe we could consider > pushing down the PG_writeback check into the ->migrate_folio ops? As an > initial step, we could just make them all return -EBUSY, and then allow > some (like FUSE) to handle the situation properly. > -- > Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 21:13 ` David Hildenbrand 2025-01-10 22:00 ` Shakeel Butt @ 2025-01-10 23:11 ` Jeff Layton 1 sibling, 0 replies; 124+ messages in thread From: Jeff Layton @ 2025-01-10 23:11 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt Cc: Miklos Szeredi, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, 2025-01-10 at 22:13 +0100, David Hildenbrand wrote: > On 10.01.25 21:28, Jeff Layton wrote: > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: > > > On 07.01.25 19:07, Shakeel Butt wrote: > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: > > > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > > > good topic for LSF/MM. > > > > > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > > > problem is either that > > > > > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > > > locked for an indeterminate amount of time, which can also block > > > > > > > compaction, right? > > > > > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered > > > > > by an trusted source. > > > > > > > > > > It's a violation of core-mm principles. > > > > > > > > The "must not be unmovable pages ever" is a very strong statement and we > > > > are violating it today and will keep violating it in future. Any > > > > page/folio under lock or writeback or have reference taken or have been > > > > isolated from their LRU is unmovable (most of the time for small period > > > > of time). > > > > > > ^ this: "small period of time" is what I meant. > > > > > > Most of these things are known to not be problematic: retrying a couple > > > of times makes it work, that's why migration keeps retrying. > > > > > > Again, as an example, we allow short-term O_DIRECT but disallow > > > long-term page pinning. I think there were concerns at some point if > > > O_DIRECT might also be problematic (I/O might take a while), but so far > > > it was not a problem in practice that would make CMA allocations easily > > > fail. > > > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but > > > actually triggers long-term pinning; IIRC David Howells has this on his > > > todo list to fix. [I recall that seccomp disallows vmsplice by default > > > right now] > > > > > > These operations are being done all over the place in kernel. > > > > Miklos gave an example of readahead. > > > > > > I assume you mean "unmovable for a short time", correct, or can you > > > point me at that specific example; I think I missed that. > > > > > > > The per-CPU LRU caches are another > > > > case where folios can get stuck for long period of time. > > > > > > Which is why memory offlining disables the lru cache. See > > > lru_cache_disable(). Other users that care about that drain the LRU on > > > all cpus. > > > > > > > Reclaim and > > > > compaction can isolate a lot of folios that they need to have > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is > > > > impractical. > > > > > > "must only be short-term unmovable", better? > > > > > > > Still a little ambiguous. > > > > How short is "short-term"? Are we talking milliseconds or minutes? > > Usually a couple of seconds, max. For memory offlining, slightly longer > times are acceptable; other things (in particular compaction or CMA > allocations) will give up much faster. > > > > > Imposing a hard timeout on writeback requests to unprivileged FUSE > > servers might give us a better guarantee of forward-progress, but it > > would probably have to be on the order of at least a minute or so to be > > workable. > > Yes, and that might already be a bit too much, especially if stuck on > waiting for folio writeback ... so ideally we could find a way to > migrate these folios that are under writeback and it's not your ordinary > disk driver that responds rather quickly. > That would be ideal I think. One thought: In practice, a lot of these writeback handers use the folio up front and then don't need to touch it again afterward until the reply comes in and they clear the writeback bit. Maybe we could add a mechanism where the writeback handers could mark the folio as being moveable after the first phase was done? When the reply comes in, they would clear that mark and check whether it's been moved in the interim, and fix up the appropriate pointers if so? Implementing that sounds a bit complex though since it's effectively a new locking scheme. > Right now we do it via these temp pages, and I can see how that's > undesirable. > > For NFS etc. we probably never ran into this, because it's all used in > fairly well managed environments and, well, I assume NFS easily outdates > CMA and ZONE_MOVABLE :) > > > >>> > > > > The point is that, yes we should aim to improve things but in iterations > > > > and "must not be unmovable pages ever" is not something we can achieve > > > > in one step. > > > > > > I agree with the "improve things in iterations", but as > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we > > > are making things worse. > > > > > > And as this discussion has been going on for too long, to summarize my > > > point: there exist conditions where pages are short-term unmovable, and > > > possibly some to be fixed that turn pages long-term unmovable (e.g., > > > vmsplice); that does not mean that we can freely add new conditions that > > > turn movable pages unmovable long-term or even forever. > > > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I > > > would suggest a topic around which things are know to cause pages to be > > > short-term or long-term unmovable/unsplittable, and which can be > > > handled, which not. Maybe I'll find the time to propose that as a topic. > > > > > > > > > This does sound like great LSF/MM fodder! I predict that this session > > will run long! ;) > > Heh, fully agreed! :) > -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-07 8:34 ` David Hildenbrand 2025-01-07 18:07 ` Shakeel Butt @ 2025-01-10 20:16 ` Jeff Layton 2025-01-10 20:20 ` David Hildenbrand 1 sibling, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-10 20:16 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt, Miklos Szeredi Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote: > On 06.01.25 19:17, Shakeel Butt wrote: > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > In any case, having movable pages be turned unmovable due to persistent > > > > writaback is something that must be fixed, not worked around. Likely a > > > > good topic for LSF/MM. > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > So the issue discussed here is that movable pages used for fuse > > > page-cache cause a problems when memory needs to be compacted. The > > > problem is either that > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > - the compaction is blocked for an unbounded time > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > worse, the same thing happens on readahead, since the new page can be > > > locked for an indeterminate amount of time, which can also block > > > compaction, right? > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is > these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there > *must not be unmovable pages ever*. Not triggered by an untrusted > source, not triggered by an trusted source. > > It's a violation of core-mm principles. > > Even if we have a timeout of 60s, making things like alloc_contig_page() > wait for that long on writeback is broken and needs to be fixed. > > And the fix is not to skip these pages, that's a workaround. > > I'm hoping I can find an easy way to trigger this also with NFS. > I imagine that you can just open a file and start writing to it, pull the plug on the NFS server, and then issue a fsync or something to ensure some writeback occurs. Any dirty pagecache folios should be stuck in writeback at that point. The NFS client is also very patient about waiting for the server to come back, so it should stay that way indefinitely. > > > > Yes locked pages are unmovable. How much of these locked pages/folios > > can be caused by untrusted fuse server? > > >> > > > What about explicitly opting fuse cache pages out of compaction by > > > allocating them form ZONE_UNMOVABLE? > > > > This can be done but it will change the memory condition of the > > users/workloads/systems where page cache is the majority of the memory > > (i.e. majority of memory will be unmovable) and when such systems are > > overcommitted, weird corner cases will arise (failing high order > > allocations, long term fragmentation etc). In addition the memory > > behind CXL will become unusable for fuse folios. > > Yes. > > > > > IMHO the transient unmovable state of fuse folios due to writeback is > > not an issue if we can show that untrusted fuse server can not cause > > unlimited folios under writeback for arbitrary long time. > > See above, I disagree. > -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 20:16 ` Jeff Layton @ 2025-01-10 20:20 ` David Hildenbrand 2025-01-10 20:43 ` Jeff Layton 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2025-01-10 20:20 UTC (permalink / raw) To: Jeff Layton, Shakeel Butt, Miklos Szeredi Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 10.01.25 21:16, Jeff Layton wrote: > On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote: >> On 06.01.25 19:17, Shakeel Butt wrote: >>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>>>> In any case, having movable pages be turned unmovable due to persistent >>>>> writaback is something that must be fixed, not worked around. Likely a >>>>> good topic for LSF/MM. >>>> >>>> Yes, this seems a good cross fs-mm topic. >>>> >>>> So the issue discussed here is that movable pages used for fuse >>>> page-cache cause a problems when memory needs to be compacted. The >>>> problem is either that >>>> >>>> - the page is skipped, leaving the physical memory block unmovable >>>> >>>> - the compaction is blocked for an unbounded time >>>> >>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >>>> worse, the same thing happens on readahead, since the new page can be >>>> locked for an indeterminate amount of time, which can also block >>>> compaction, right? >> >> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is >> these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there >> *must not be unmovable pages ever*. Not triggered by an untrusted >> source, not triggered by an trusted source. >> >> It's a violation of core-mm principles. >> >> Even if we have a timeout of 60s, making things like alloc_contig_page() >> wait for that long on writeback is broken and needs to be fixed. >> >> And the fix is not to skip these pages, that's a workaround. >> >> I'm hoping I can find an easy way to trigger this also with NFS. >> > > I imagine that you can just open a file and start writing to it, pull > the plug on the NFS server, and then issue a fsync or something to > ensure some writeback occurs. Yes, that's the plan, thanks! > > Any dirty pagecache folios should be stuck in writeback at that point. > The NFS client is also very patient about waiting for the server to > come back, so it should stay that way indefinitely. Yes, however the default timeout for UDP is fairly small (for TCP certainly much longer). So one thing I'd like to understand what that "cancel writeback -> redirty folio" on timeout does, and when it actually triggers with TCP vs UDP timeouts. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 20:20 ` David Hildenbrand @ 2025-01-10 20:43 ` Jeff Layton 2025-01-10 21:00 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-10 20:43 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt, Miklos Szeredi Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote: > On 10.01.25 21:16, Jeff Layton wrote: > > On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote: > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > good topic for LSF/MM. > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > problem is either that > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > locked for an indeterminate amount of time, which can also block > > > > > compaction, right? > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is > > > these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there > > > *must not be unmovable pages ever*. Not triggered by an untrusted > > > source, not triggered by an trusted source. > > > > > > It's a violation of core-mm principles. > > > > > > Even if we have a timeout of 60s, making things like alloc_contig_page() > > > wait for that long on writeback is broken and needs to be fixed. > > > > > > And the fix is not to skip these pages, that's a workaround. > > > > > > I'm hoping I can find an easy way to trigger this also with NFS. > > > > > > > I imagine that you can just open a file and start writing to it, pull > > the plug on the NFS server, and then issue a fsync or something to > > ensure some writeback occurs. > > Yes, that's the plan, thanks! > > > > > Any dirty pagecache folios should be stuck in writeback at that point. > > The NFS client is also very patient about waiting for the server to > > come back, so it should stay that way indefinitely. > > Yes, however the default timeout for UDP is fairly small (for TCP > certainly much longer). So one thing I'd like to understand what that > "cancel writeback -> redirty folio" on timeout does, and when it > actually triggers with TCP vs UDP timeouts. > The lifetime of the pagecache pages is not at all related to the socket lifetimes. IOW, the client can completely lose the connection to the server and the page will just stay dirty until the connection can be reestablished and the server responds. The exception here is if you mount with "-o soft" in which case, an RPC request will time out with an error after a major RPC timeout (usually after a minute or so). See nfs(5) for the gory details of timeouts and retransmission. The default is "-o hard" since that's necessary for data-integrity in the face of spotty network connections. Once a soft mount has a writeback RPC time out, the folio is marked clean and a writeback error is set on the mapping, so that fsync() will return an error. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 20:43 ` Jeff Layton @ 2025-01-10 21:00 ` David Hildenbrand 2025-01-10 21:07 ` Jeff Layton 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2025-01-10 21:00 UTC (permalink / raw) To: Jeff Layton, Shakeel Butt, Miklos Szeredi Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 10.01.25 21:43, Jeff Layton wrote: > On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote: >> On 10.01.25 21:16, Jeff Layton wrote: >>> On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote: >>>> On 06.01.25 19:17, Shakeel Butt wrote: >>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>>>>>> In any case, having movable pages be turned unmovable due to persistent >>>>>>> writaback is something that must be fixed, not worked around. Likely a >>>>>>> good topic for LSF/MM. >>>>>> >>>>>> Yes, this seems a good cross fs-mm topic. >>>>>> >>>>>> So the issue discussed here is that movable pages used for fuse >>>>>> page-cache cause a problems when memory needs to be compacted. The >>>>>> problem is either that >>>>>> >>>>>> - the page is skipped, leaving the physical memory block unmovable >>>>>> >>>>>> - the compaction is blocked for an unbounded time >>>>>> >>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >>>>>> worse, the same thing happens on readahead, since the new page can be >>>>>> locked for an indeterminate amount of time, which can also block >>>>>> compaction, right? >>>> >>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is >>>> these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there >>>> *must not be unmovable pages ever*. Not triggered by an untrusted >>>> source, not triggered by an trusted source. >>>> >>>> It's a violation of core-mm principles. >>>> >>>> Even if we have a timeout of 60s, making things like alloc_contig_page() >>>> wait for that long on writeback is broken and needs to be fixed. >>>> >>>> And the fix is not to skip these pages, that's a workaround. >>>> >>>> I'm hoping I can find an easy way to trigger this also with NFS. >>>> >>> >>> I imagine that you can just open a file and start writing to it, pull >>> the plug on the NFS server, and then issue a fsync or something to >>> ensure some writeback occurs. >> >> Yes, that's the plan, thanks! >> >>> >>> Any dirty pagecache folios should be stuck in writeback at that point. >>> The NFS client is also very patient about waiting for the server to >>> come back, so it should stay that way indefinitely. >> >> Yes, however the default timeout for UDP is fairly small (for TCP >> certainly much longer). So one thing I'd like to understand what that >> "cancel writeback -> redirty folio" on timeout does, and when it >> actually triggers with TCP vs UDP timeouts. >> > > > The lifetime of the pagecache pages is not at all related to the socket > lifetimes. IOW, the client can completely lose the connection to the > server and the page will just stay dirty until the connection can be > reestablished and the server responds. Right. It cannot get reclaimed while that is the case. > > The exception here is if you mount with "-o soft" in which case, an RPC > request will time out with an error after a major RPC timeout (usually > after a minute or so). See nfs(5) for the gory details of timeouts and > retransmission. The default is "-o hard" since that's necessary for > data-integrity in the face of spotty network connections. > > Once a soft mount has a writeback RPC time out, the folio is marked > clean and a writeback error is set on the mapping, so that fsync() will > return an error. I assume that's the code I stumbled over in nfs_page_async_flush(), where we end up calling folio_redirty_for_writepage() + nfs_redirty_request(), unless we run into a fatal error; in that case, we end up in nfs_write_error() where we set the mapping error and stop writeback using nfs_page_end_writeback(). -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 21:00 ` David Hildenbrand @ 2025-01-10 21:07 ` Jeff Layton 2025-01-10 21:21 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Jeff Layton @ 2025-01-10 21:07 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt, Miklos Szeredi Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, 2025-01-10 at 22:00 +0100, David Hildenbrand wrote: > On 10.01.25 21:43, Jeff Layton wrote: > > On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote: > > > On 10.01.25 21:16, Jeff Layton wrote: > > > > On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote: > > > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > > > good topic for LSF/MM. > > > > > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > > > problem is either that > > > > > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > > > locked for an indeterminate amount of time, which can also block > > > > > > > compaction, right? > > > > > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is > > > > > these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there > > > > > *must not be unmovable pages ever*. Not triggered by an untrusted > > > > > source, not triggered by an trusted source. > > > > > > > > > > It's a violation of core-mm principles. > > > > > > > > > > Even if we have a timeout of 60s, making things like alloc_contig_page() > > > > > wait for that long on writeback is broken and needs to be fixed. > > > > > > > > > > And the fix is not to skip these pages, that's a workaround. > > > > > > > > > > I'm hoping I can find an easy way to trigger this also with NFS. > > > > > > > > > > > > > I imagine that you can just open a file and start writing to it, pull > > > > the plug on the NFS server, and then issue a fsync or something to > > > > ensure some writeback occurs. > > > > > > Yes, that's the plan, thanks! > > > > > > > > > > > Any dirty pagecache folios should be stuck in writeback at that point. > > > > The NFS client is also very patient about waiting for the server to > > > > come back, so it should stay that way indefinitely. > > > > > > Yes, however the default timeout for UDP is fairly small (for TCP > > > certainly much longer). So one thing I'd like to understand what that > > > "cancel writeback -> redirty folio" on timeout does, and when it > > > actually triggers with TCP vs UDP timeouts. > > > > > > > > > The lifetime of the pagecache pages is not at all related to the socket > > lifetimes. IOW, the client can completely lose the connection to the > > server and the page will just stay dirty until the connection can be > > reestablished and the server responds. > > Right. It cannot get reclaimed while that is the case. > > > > > The exception here is if you mount with "-o soft" in which case, an RPC > > request will time out with an error after a major RPC timeout (usually > > after a minute or so). See nfs(5) for the gory details of timeouts and > > retransmission. The default is "-o hard" since that's necessary for > > data-integrity in the face of spotty network connections. > > > > Once a soft mount has a writeback RPC time out, the folio is marked > > clean and a writeback error is set on the mapping, so that fsync() will > > return an error. > > I assume that's the code I stumbled over in nfs_page_async_flush(), > where we end up calling folio_redirty_for_writepage() + > nfs_redirty_request(), unless we run into a fatal error; in that case, > we end up in nfs_write_error() where we set the mapping error and stop > writeback using nfs_page_end_writeback(). > Exactly. The upshot is that you can dirty NFS pages that will sit in the pagecache indefinitely, if you can disrupt the connection to the server indefinitely. This is substantially the same in other netfs's too -- CIFS, Ceph, etc. The big difference vs FUSE is that they don't allow unprivileged users to mount arbitrary filesystems, so it's a harder for an attacker to do this with only a local unprivileged account to work with. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-10 21:07 ` Jeff Layton @ 2025-01-10 21:21 ` David Hildenbrand 0 siblings, 0 replies; 124+ messages in thread From: David Hildenbrand @ 2025-01-10 21:21 UTC (permalink / raw) To: Jeff Layton, Shakeel Butt, Miklos Szeredi Cc: Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 10.01.25 22:07, Jeff Layton wrote: > On Fri, 2025-01-10 at 22:00 +0100, David Hildenbrand wrote: >> On 10.01.25 21:43, Jeff Layton wrote: >>> On Fri, 2025-01-10 at 21:20 +0100, David Hildenbrand wrote: >>>> On 10.01.25 21:16, Jeff Layton wrote: >>>>> On Tue, 2025-01-07 at 09:34 +0100, David Hildenbrand wrote: >>>>>> On 06.01.25 19:17, Shakeel Butt wrote: >>>>>>> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >>>>>>>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>>>>>>>> In any case, having movable pages be turned unmovable due to persistent >>>>>>>>> writaback is something that must be fixed, not worked around. Likely a >>>>>>>>> good topic for LSF/MM. >>>>>>>> >>>>>>>> Yes, this seems a good cross fs-mm topic. >>>>>>>> >>>>>>>> So the issue discussed here is that movable pages used for fuse >>>>>>>> page-cache cause a problems when memory needs to be compacted. The >>>>>>>> problem is either that >>>>>>>> >>>>>>>> - the page is skipped, leaving the physical memory block unmovable >>>>>>>> >>>>>>>> - the compaction is blocked for an unbounded time >>>>>>>> >>>>>>>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >>>>>>>> worse, the same thing happens on readahead, since the new page can be >>>>>>>> locked for an indeterminate amount of time, which can also block >>>>>>>> compaction, right? >>>>>> >>>>>> Yes, as memory hotplug + virtio-mem maintainer my bigger concern is >>>>>> these pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there >>>>>> *must not be unmovable pages ever*. Not triggered by an untrusted >>>>>> source, not triggered by an trusted source. >>>>>> >>>>>> It's a violation of core-mm principles. >>>>>> >>>>>> Even if we have a timeout of 60s, making things like alloc_contig_page() >>>>>> wait for that long on writeback is broken and needs to be fixed. >>>>>> >>>>>> And the fix is not to skip these pages, that's a workaround. >>>>>> >>>>>> I'm hoping I can find an easy way to trigger this also with NFS. >>>>>> >>>>> >>>>> I imagine that you can just open a file and start writing to it, pull >>>>> the plug on the NFS server, and then issue a fsync or something to >>>>> ensure some writeback occurs. >>>> >>>> Yes, that's the plan, thanks! >>>> >>>>> >>>>> Any dirty pagecache folios should be stuck in writeback at that point. >>>>> The NFS client is also very patient about waiting for the server to >>>>> come back, so it should stay that way indefinitely. >>>> >>>> Yes, however the default timeout for UDP is fairly small (for TCP >>>> certainly much longer). So one thing I'd like to understand what that >>>> "cancel writeback -> redirty folio" on timeout does, and when it >>>> actually triggers with TCP vs UDP timeouts. >>>> >>> >>> >>> The lifetime of the pagecache pages is not at all related to the socket >>> lifetimes. IOW, the client can completely lose the connection to the >>> server and the page will just stay dirty until the connection can be >>> reestablished and the server responds. >> >> Right. It cannot get reclaimed while that is the case. >> >>> >>> The exception here is if you mount with "-o soft" in which case, an RPC >>> request will time out with an error after a major RPC timeout (usually >>> after a minute or so). See nfs(5) for the gory details of timeouts and >>> retransmission. The default is "-o hard" since that's necessary for >>> data-integrity in the face of spotty network connections. >>> >>> Once a soft mount has a writeback RPC time out, the folio is marked >>> clean and a writeback error is set on the mapping, so that fsync() will >>> return an error. >> >> I assume that's the code I stumbled over in nfs_page_async_flush(), >> where we end up calling folio_redirty_for_writepage() + >> nfs_redirty_request(), unless we run into a fatal error; in that case, >> we end up in nfs_write_error() where we set the mapping error and stop >> writeback using nfs_page_end_writeback(). >> > > Exactly. > > The upshot is that you can dirty NFS pages that will sit in the > pagecache indefinitely, if you can disrupt the connection to the server > indefinitely. This is substantially the same in other netfs's too -- > CIFS, Ceph, etc. > > The big difference vs FUSE is that they don't allow unprivileged users > to mount arbitrary filesystems, so it's a harder for an attacker to do > this with only a local unprivileged account to work with. Exactly my point/concern. With most netfs's I would assume that reliable connections are mandatory, otherwise you might be in bigger trouble, maybe one of the reasons being stuck forever waiting for writeback on folios was not identified as a problem so far. Maybe :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-06 18:17 ` Shakeel Butt 2025-01-07 8:34 ` David Hildenbrand @ 2025-01-07 16:15 ` Miklos Szeredi 2025-01-08 1:40 ` Jingbo Xu 1 sibling, 1 reply; 124+ messages in thread From: Miklos Szeredi @ 2025-01-07 16:15 UTC (permalink / raw) To: Shakeel Butt Cc: David Hildenbrand, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, 6 Jan 2025 at 19:17, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: > > > In any case, having movable pages be turned unmovable due to persistent > > > writaback is something that must be fixed, not worked around. Likely a > > > good topic for LSF/MM. > > > > Yes, this seems a good cross fs-mm topic. > > > > So the issue discussed here is that movable pages used for fuse > > page-cache cause a problems when memory needs to be compacted. The > > problem is either that > > > > - the page is skipped, leaving the physical memory block unmovable > > > > - the compaction is blocked for an unbounded time > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > worse, the same thing happens on readahead, since the new page can be > > locked for an indeterminate amount of time, which can also block > > compaction, right? > > Yes locked pages are unmovable. How much of these locked pages/folios > can be caused by untrusted fuse server? A stuck server would quickly reach the background threshold at which point everything stops. So my guess is that accidentally this won't do much harm. Doing it deliberately (tuning max_background, starting multiple servers) the number of pages that are permanently locked could be basically unlimited. Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-07 16:15 ` Miklos Szeredi @ 2025-01-08 1:40 ` Jingbo Xu 0 siblings, 0 replies; 124+ messages in thread From: Jingbo Xu @ 2025-01-08 1:40 UTC (permalink / raw) To: Miklos Szeredi, Shakeel Butt Cc: David Hildenbrand, Joanne Koong, Bernd Schubert, Zi Yan, linux-fsdevel, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 1/8/25 12:15 AM, Miklos Szeredi wrote: > On Mon, 6 Jan 2025 at 19:17, Shakeel Butt <shakeel.butt@linux.dev> wrote: >> >> On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: >>> On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote: >>>> In any case, having movable pages be turned unmovable due to persistent >>>> writaback is something that must be fixed, not worked around. Likely a >>>> good topic for LSF/MM. >>> >>> Yes, this seems a good cross fs-mm topic. >>> >>> So the issue discussed here is that movable pages used for fuse >>> page-cache cause a problems when memory needs to be compacted. The >>> problem is either that >>> >>> - the page is skipped, leaving the physical memory block unmovable >>> >>> - the compaction is blocked for an unbounded time >>> >>> While the new AS_WRITEBACK_INDETERMINATE could potentially make things >>> worse, the same thing happens on readahead, since the new page can be >>> locked for an indeterminate amount of time, which can also block >>> compaction, right? >> >> Yes locked pages are unmovable. How much of these locked pages/folios >> can be caused by untrusted fuse server? > > A stuck server would quickly reach the background threshold at which > point everything stops. So my guess is that accidentally this won't > do much harm. > > Doing it deliberately (tuning max_background, starting multiple > servers) the number of pages that are permanently locked could be > basically unlimited. If "limiting the number of actually unmovable pages in a reasonable bound" is acceptable, maybe we could limit the maximum number of background requests that the whole unprivileged FUSE servers could achieve. BTW currently the writeback requests are not limited by max_background as the writeback routine allocates requests with "force == true". We had ever noticed that heavy writeback workload could starve other background requests (e.g. readahead), in which the readahead routine were waiting in fuse_get_req() forever until the writeback workload finished. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-30 18:38 ` Joanne Koong 2024-12-30 19:52 ` David Hildenbrand @ 2024-12-30 20:04 ` Shakeel Butt 2025-01-02 19:59 ` Joanne Koong 1 sibling, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-30 20:04 UTC (permalink / raw) To: Joanne Koong Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote: Thanks David for the response. > > > > >> BTW, I just looked at NFS out of interest, in particular > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > > >> privileged user that mounts it can set higher ones. I guess one could run > > >> into similar writeback issues? > > > > > > > Hi, > > > > sorry for the late reply. > > > > > Yes, I think so. > > > > > >> > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > > > We are adding logic that says "unconditionally, never wait on writeback > > for these folios, not even any sync migration". That's the main problem > > I have. > > > > Your explanation below is helpful. Because ... > > > > > So, let me explain why it is required (but later I will tell you how it > > > can be avoided). The FUSE thread which is actively handling writeback of > > > a given folio can cause memory allocation either through syscall or page > > > fault. That memory allocation can trigger global reclaim synchronously > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > > allocations. The userspace fs can also use a similar approach which is > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > > told that it is hard to use as it is per-thread flag and has to be set > > > for all the threads handling writeback which can be error prone if the > > > threadpool is dynamic. Second it is very coarse such that all the > > > allocations from those threads (e.g. page faults) become NOFS which > > > makes userspace very unreliable on highly utilized machine as NOFS can > > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > > > > ... now I understand that we want to prevent a deadlock in one specific > > scenario only? > > > > What sounds plausible for me is: > > > > a) Make this only affect the actual deadlock path: sync migration > > during compaction. Communicate it either using some "context" > > information or with a new MIGRATE_SYNC_COMPACTION. > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > > that very deadlock problem. > > c) Leave all others sync migration users alone for now > > The deadlock path is separate from sync migration. The deadlock arises > from a corner case where cgroupv1 reclaim waits on a folio under > writeback where that writeback itself is blocked on reclaim. > Joanne, let's drop the patch to migrate.c completely and let's rename the flag to something like what David is suggesting and only handle in the reclaim path. > > > > Would that prevent the deadlock? Even *better* would be to to be able to > > ask the fs if starting writeback on a specific folio could deadlock. > > Because in most cases, as I understand, we'll not actually run into the > > deadlock and would just want to wait for writeback to just complete > > (esp. compaction). > > > > (I still think having folios under writeback for a long time might be a > > problem, but that's indeed something to sort out separately in the > > future, because I suspect NFS has similar issues. We'd want to "wait > > with timeout" and e.g., cancel writeback during memory > > offlining/alloc_cma ...) Thanks David and yes let's handle the folios under writeback issue separately. > > I'm looking back at some of the discussions in v2 [1] and I'm still > not clear on how memory fragmentation for non-movable pages differs > from memory fragmentation from movable pages and whether one is worse > than the other. I think the fragmentation due to movable pages becoming unmovable is worse as that situation is unexpected and the kernel can waste a lot of CPU to defrag the block containing those folios. For non-movable blocks, the kernel will not even try to defrag. Now we can have a situation where almost all memory is backed by non-movable blocks and higher order allocations start failing even when there is enough free memory. For such situations either system needs to be restarted (or workloads restarted if they are cause of high non-movable memory) or the admin needs to setup ZONE_MOVABLE where non-movable allocations don't go. > Currently fuse uses movable temp pages (allocated with > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > issue where a buggy/malicious server may never complete writeback. So, these temp pages are not an issue for fragmenting the movable blocks but if there is no limit on temp pages, the whole system can become non-movable (there is a case where movable blocks on non-ZONE_MOVABLE can be converted into non-movable blocks under low memory). ZONE_MOVABLE will avoid such scenario but tuning the right size of ZONE_MOVABLE is not easy. > This has the same effect of fragmenting memory and has a worse memory > cost to the system in terms of memory used. With not having temp pages > though, now in this scenario, pages allocated in a movable page block > can't be compacted and that memory is fragmented. My (basic and maybe > incorrect) understanding is that memory gets allocated through a buddy > allocator and moveable vs nonmovable pages get allocated to > corresponding blocks that match their type, but there's no other > difference otherwise. Is this understanding correct? Or is there some > substantial difference between fragmentation for movable vs nonmovable > blocks? The main difference is the fallback of high order allocation which can trigger compaction or background compaction through kcompactd. The kernel will only try to defrag the movable blocks. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-30 20:04 ` Shakeel Butt @ 2025-01-02 19:59 ` Joanne Koong 2025-01-02 20:26 ` Zi Yan 0 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2025-01-02 19:59 UTC (permalink / raw) To: Shakeel Butt Cc: David Hildenbrand, Bernd Schubert, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Mon, Dec 30, 2024 at 12:04 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote: > > Thanks David for the response. > > > > > > > >> BTW, I just looked at NFS out of interest, in particular > > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > > > >> privileged user that mounts it can set higher ones. I guess one could run > > > >> into similar writeback issues? > > > > > > > > > > Hi, > > > > > > sorry for the late reply. > > > > > > > Yes, I think so. > > > > > > > >> > > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > > > > > We are adding logic that says "unconditionally, never wait on writeback > > > for these folios, not even any sync migration". That's the main problem > > > I have. > > > > > > Your explanation below is helpful. Because ... > > > > > > > So, let me explain why it is required (but later I will tell you how it > > > > can be avoided). The FUSE thread which is actively handling writeback of > > > > a given folio can cause memory allocation either through syscall or page > > > > fault. That memory allocation can trigger global reclaim synchronously > > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > > > allocations. The userspace fs can also use a similar approach which is > > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > > > told that it is hard to use as it is per-thread flag and has to be set > > > > for all the threads handling writeback which can be error prone if the > > > > threadpool is dynamic. Second it is very coarse such that all the > > > > allocations from those threads (e.g. page faults) become NOFS which > > > > makes userspace very unreliable on highly utilized machine as NOFS can > > > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > > > > > > > ... now I understand that we want to prevent a deadlock in one specific > > > scenario only? > > > > > > What sounds plausible for me is: > > > > > > a) Make this only affect the actual deadlock path: sync migration > > > during compaction. Communicate it either using some "context" > > > information or with a new MIGRATE_SYNC_COMPACTION. > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > > > that very deadlock problem. > > > c) Leave all others sync migration users alone for now > > > > The deadlock path is separate from sync migration. The deadlock arises > > from a corner case where cgroupv1 reclaim waits on a folio under > > writeback where that writeback itself is blocked on reclaim. > > > > Joanne, let's drop the patch to migrate.c completely and let's rename > the flag to something like what David is suggesting and only handle in > the reclaim path. > > > > > > > Would that prevent the deadlock? Even *better* would be to to be able to > > > ask the fs if starting writeback on a specific folio could deadlock. > > > Because in most cases, as I understand, we'll not actually run into the > > > deadlock and would just want to wait for writeback to just complete > > > (esp. compaction). > > > > > > (I still think having folios under writeback for a long time might be a > > > problem, but that's indeed something to sort out separately in the > > > future, because I suspect NFS has similar issues. We'd want to "wait > > > with timeout" and e.g., cancel writeback during memory > > > offlining/alloc_cma ...) > > Thanks David and yes let's handle the folios under writeback issue > separately. > > > > > I'm looking back at some of the discussions in v2 [1] and I'm still > > not clear on how memory fragmentation for non-movable pages differs > > from memory fragmentation from movable pages and whether one is worse > > than the other. > > I think the fragmentation due to movable pages becoming unmovable is > worse as that situation is unexpected and the kernel can waste a lot of > CPU to defrag the block containing those folios. For non-movable blocks, > the kernel will not even try to defrag. Now we can have a situation > where almost all memory is backed by non-movable blocks and higher order > allocations start failing even when there is enough free memory. For > such situations either system needs to be restarted (or workloads > restarted if they are cause of high non-movable memory) or the admin > needs to setup ZONE_MOVABLE where non-movable allocations don't go. Thanks for the explanations. The reason I ask is because I'm trying to figure out if having a time interval wait or retry mechanism instead of skipping migration would be a viable solution. Where when attempting the migration for folios with the as_writeback_indeterminate flag that are under writeback, it'll wait on folio writeback for a certain amount of time and then skip the migration if no progress has been made and the folio is still under writeback. there are two cases for fuse folios under writeback (for folios not under writeback, migration will work as is): a) normal case: server is not malicious or buggy, writeback is completed in a timely manner. For this case, migration would be successful and there'd be no difference for this between having no temp pages vs temp pages b) server is malicious or buggy: eg the server never completes writeback With no temp pages: The folio under writeback prevents a memory block (not sure how big this usually is?) from being compacted, leading to memory fragmentation With temp pages: fuse allocates a non-movable page for every page it needs to write back, which worsens memory usage, these pages will never get freed since the server never finishes writeback on them. The non-movable pages could also fragment memory blocks like in the scenario with no temp pages. Is the b) case with no temp pages worse for memory health than the scenario with temp pages? For the cpu usage issue (eg kernel keeps trying to defrag blocks containing these problematic folios), it seems like this could be potentially mitigated by marking these blocks as uncompactable? Thanks, Joanne > > > Currently fuse uses movable temp pages (allocated with > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > > issue where a buggy/malicious server may never complete writeback. > > So, these temp pages are not an issue for fragmenting the movable blocks > but if there is no limit on temp pages, the whole system can become > non-movable (there is a case where movable blocks on non-ZONE_MOVABLE > can be converted into non-movable blocks under low memory). ZONE_MOVABLE > will avoid such scenario but tuning the right size of ZONE_MOVABLE is > not easy. > > > This has the same effect of fragmenting memory and has a worse memory > > cost to the system in terms of memory used. With not having temp pages > > though, now in this scenario, pages allocated in a movable page block > > can't be compacted and that memory is fragmented. My (basic and maybe > > incorrect) understanding is that memory gets allocated through a buddy > > allocator and moveable vs nonmovable pages get allocated to > > corresponding blocks that match their type, but there's no other > > difference otherwise. Is this understanding correct? Or is there some > > substantial difference between fragmentation for movable vs nonmovable > > blocks? > > The main difference is the fallback of high order allocation which can > trigger compaction or background compaction through kcompactd. The > kernel will only try to defrag the movable blocks. > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-01-02 19:59 ` Joanne Koong @ 2025-01-02 20:26 ` Zi Yan 0 siblings, 0 replies; 124+ messages in thread From: Zi Yan @ 2025-01-02 20:26 UTC (permalink / raw) To: Joanne Koong, Shakeel Butt Cc: David Hildenbrand, Bernd Schubert, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu Jan 2, 2025 at 2:59 PM EST, Joanne Koong wrote: > On Mon, Dec 30, 2024 at 12:04 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > > > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote: > > > > Thanks David for the response. > > > > > > > > > > >> BTW, I just looked at NFS out of interest, in particular > > > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > > > > >> privileged user that mounts it can set higher ones. I guess one could run > > > > >> into similar writeback issues? > > > > > > > > > > > > > Hi, > > > > > > > > sorry for the late reply. > > > > > > > > > Yes, I think so. > > > > > > > > > >> > > > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > > > > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > > > > > > > We are adding logic that says "unconditionally, never wait on writeback > > > > for these folios, not even any sync migration". That's the main problem > > > > I have. > > > > > > > > Your explanation below is helpful. Because ... > > > > > > > > > So, let me explain why it is required (but later I will tell you how it > > > > > can be avoided). The FUSE thread which is actively handling writeback of > > > > > a given folio can cause memory allocation either through syscall or page > > > > > fault. That memory allocation can trigger global reclaim synchronously > > > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > > > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > > > > allocations. The userspace fs can also use a similar approach which is > > > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > > > > told that it is hard to use as it is per-thread flag and has to be set > > > > > for all the threads handling writeback which can be error prone if the > > > > > threadpool is dynamic. Second it is very coarse such that all the > > > > > allocations from those threads (e.g. page faults) become NOFS which > > > > > makes userspace very unreliable on highly utilized machine as NOFS can > > > > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > > > > > > > > > > ... now I understand that we want to prevent a deadlock in one specific > > > > scenario only? > > > > > > > > What sounds plausible for me is: > > > > > > > > a) Make this only affect the actual deadlock path: sync migration > > > > during compaction. Communicate it either using some "context" > > > > information or with a new MIGRATE_SYNC_COMPACTION. > > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > > > > that very deadlock problem. > > > > c) Leave all others sync migration users alone for now > > > > > > The deadlock path is separate from sync migration. The deadlock arises > > > from a corner case where cgroupv1 reclaim waits on a folio under > > > writeback where that writeback itself is blocked on reclaim. > > > > > > > Joanne, let's drop the patch to migrate.c completely and let's rename > > the flag to something like what David is suggesting and only handle in > > the reclaim path. > > > > > > > > > > Would that prevent the deadlock? Even *better* would be to to be able to > > > > ask the fs if starting writeback on a specific folio could deadlock. > > > > Because in most cases, as I understand, we'll not actually run into the > > > > deadlock and would just want to wait for writeback to just complete > > > > (esp. compaction). > > > > > > > > (I still think having folios under writeback for a long time might be a > > > > problem, but that's indeed something to sort out separately in the > > > > future, because I suspect NFS has similar issues. We'd want to "wait > > > > with timeout" and e.g., cancel writeback during memory > > > > offlining/alloc_cma ...) > > > > Thanks David and yes let's handle the folios under writeback issue > > separately. > > > > > > > > I'm looking back at some of the discussions in v2 [1] and I'm still > > > not clear on how memory fragmentation for non-movable pages differs > > > from memory fragmentation from movable pages and whether one is worse > > > than the other. > > > > I think the fragmentation due to movable pages becoming unmovable is > > worse as that situation is unexpected and the kernel can waste a lot of > > CPU to defrag the block containing those folios. For non-movable blocks, > > the kernel will not even try to defrag. Now we can have a situation > > where almost all memory is backed by non-movable blocks and higher order > > allocations start failing even when there is enough free memory. For > > such situations either system needs to be restarted (or workloads > > restarted if they are cause of high non-movable memory) or the admin > > needs to setup ZONE_MOVABLE where non-movable allocations don't go. > > Thanks for the explanations. > > The reason I ask is because I'm trying to figure out if having a time > interval wait or retry mechanism instead of skipping migration would > be a viable solution. Where when attempting the migration for folios > with the as_writeback_indeterminate flag that are under writeback, > it'll wait on folio writeback for a certain amount of time and then > skip the migration if no progress has been made and the folio is still > under writeback. > > there are two cases for fuse folios under writeback (for folios not > under writeback, migration will work as is): > a) normal case: server is not malicious or buggy, writeback is > completed in a timely manner. > For this case, migration would be successful and there'd be no > difference for this between having no temp pages vs temp pages > > > b) server is malicious or buggy: > eg the server never completes writeback > > With no temp pages: > The folio under writeback prevents a memory block (not sure how big > this usually is?) from being compacted, leading to memory > fragmentation It is called pageblock. Its size is usually the same as a PMD THP (e.g., 2MB on x86_64). With no temp pages, folios can spread across multiple pageblocks, fragmenting all of them. > > With temp pages: > fuse allocates a non-movable page for every page it needs to write > back, which worsens memory usage, these pages will never get freed > since the server never finishes writeback on them. The non-movable > pages could also fragment memory blocks like in the scenario with no > temp pages. Since the temp pages are all coming from MIGRATE_UNMOVABLE pageblocks, which are much fewer, the fragmentation is much limited. > > > Is the b) case with no temp pages worse for memory health than the > scenario with temp pages? For the cpu usage issue (eg kernel keeps > trying to defrag blocks containing these problematic folios), it seems > like this could be potentially mitigated by marking these blocks as > uncompactable? With no temp pages, folios under writeback can potentially fragment more, if not all, pageblocks, compared to with temp pages, because MIGRATE_UNMOVABLE pageblocks are used for unmovable page allocations, like kernel data allocations, and are supposed to be much fewer than MIGRATE_MOVABLE pageblocks in the system. > > > Thanks, > Joanne > > > > > > Currently fuse uses movable temp pages (allocated with > > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > > > issue where a buggy/malicious server may never complete writeback. > > > > So, these temp pages are not an issue for fragmenting the movable blocks > > but if there is no limit on temp pages, the whole system can become > > non-movable (there is a case where movable blocks on non-ZONE_MOVABLE > > can be converted into non-movable blocks under low memory). ZONE_MOVABLE > > will avoid such scenario but tuning the right size of ZONE_MOVABLE is > > not easy. > > > > > This has the same effect of fragmenting memory and has a worse memory > > > cost to the system in terms of memory used. With not having temp pages > > > though, now in this scenario, pages allocated in a movable page block > > > can't be compacted and that memory is fragmented. My (basic and maybe > > > incorrect) understanding is that memory gets allocated through a buddy > > > allocator and moveable vs nonmovable pages get allocated to > > > corresponding blocks that match their type, but there's no other > > > difference otherwise. Is this understanding correct? Or is there some > > > substantial difference between fragmentation for movable vs nonmovable > > > blocks? > > > > The main difference is the fallback of high order allocation which can > > trigger compaction or background compaction through kcompactd. The > > kernel will only try to defrag the movable blocks. > > -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 14:49 ` David Hildenbrand 2024-12-20 15:26 ` Bernd Schubert 2024-12-20 18:01 ` Shakeel Butt @ 2024-12-20 21:01 ` Joanne Koong 2024-12-21 16:25 ` David Hildenbrand 2 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-12-20 21:01 UTC (permalink / raw) To: David Hildenbrand Cc: Bernd Schubert, Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> wrote: > > >> I'm wondering if there would be a way to just "cancel" the writeback and > >> mark the folio dirty again. That way it could be migrated, but not > >> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE > >> thing. > >> > > > > That is what I basically meant with short timeouts. Obviously it is not > > that simple to cancel the request and to retry - it would add in quite > > some complexity, if all the issues that arise can be solved at all. > > At least it would keep that out of core-mm. > > AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should > try to improve such scenarios, not acknowledge and integrate them, then > work around using timeouts that must be manually configured, and ca > likely no be default enabled because it could hurt reasonable use cases :( > > Right now we clear the writeback flag immediately, indicating that data > was written back, when in fact it was not written back at all. I suspect > fsync() currently handles that manually already, to wait for any of the > allocated pages to actually get written back by user space, so we have > control over when something was *actually* written back. > > > Similar to your proposal, I wonder if there could be a way to request > fuse to "abort" a writeback request (instead of using fixed timeouts per > request). Meaning, when we stumble over a folio that is under writeback > on some paths, we would tell fuse to "end writeback now", or "end > writeback now if it takes longer than X". Essentially hidden inside > folio_wait_writeback(). > > When aborting a request, as I said, we would essentially "end writeback" > and mark the folio as dirty again. The interesting thing is likely how > to handle user space that wants to process this request right now (stuck > in fuse_send_writepage() I assume?), correct? This would be fine if the writeback request hasn't been sent yet to userspace but if it has and the pages are spliced, then ending writeback could lead to memory crashes if the pipebuf buf->page is accessed as it's being migrated. When a page/folio is being migrated, is there some state set on the page to indicate that it's currently under migration? The only workaround I can see for the splice case that doesn't resort to bringing back extra copies is to have splice somehow ensure that the page isn't being migrated when it's accessing it. Thanks, Joanne > > Just throwing it out there ... no expert at all on fuse ... > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-20 21:01 ` Joanne Koong @ 2024-12-21 16:25 ` David Hildenbrand 2024-12-21 21:59 ` Bernd Schubert 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2024-12-21 16:25 UTC (permalink / raw) To: Joanne Koong Cc: Bernd Schubert, Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 20.12.24 22:01, Joanne Koong wrote: > On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> wrote: >> >>>> I'm wondering if there would be a way to just "cancel" the writeback and >>>> mark the folio dirty again. That way it could be migrated, but not >>>> reclaimed. At least we could avoid the whole AS_WRITEBACK_INDETERMINATE >>>> thing. >>>> >>> >>> That is what I basically meant with short timeouts. Obviously it is not >>> that simple to cancel the request and to retry - it would add in quite >>> some complexity, if all the issues that arise can be solved at all. >> >> At least it would keep that out of core-mm. >> >> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should >> try to improve such scenarios, not acknowledge and integrate them, then >> work around using timeouts that must be manually configured, and ca >> likely no be default enabled because it could hurt reasonable use cases :( >> >> Right now we clear the writeback flag immediately, indicating that data >> was written back, when in fact it was not written back at all. I suspect >> fsync() currently handles that manually already, to wait for any of the >> allocated pages to actually get written back by user space, so we have >> control over when something was *actually* written back. >> >> >> Similar to your proposal, I wonder if there could be a way to request >> fuse to "abort" a writeback request (instead of using fixed timeouts per >> request). Meaning, when we stumble over a folio that is under writeback >> on some paths, we would tell fuse to "end writeback now", or "end >> writeback now if it takes longer than X". Essentially hidden inside >> folio_wait_writeback(). >> >> When aborting a request, as I said, we would essentially "end writeback" >> and mark the folio as dirty again. The interesting thing is likely how >> to handle user space that wants to process this request right now (stuck >> in fuse_send_writepage() I assume?), correct? > > This would be fine if the writeback request hasn't been sent yet to > userspace but if it has and the pages are spliced Can you point me at the code where that splicing happens? , then ending > writeback could lead to memory crashes if the pipebuf buf->page is > accessed as it's being migrated. When a page/folio is being migrated, > is there some state set on the page to indicate that it's currently > under migration? Unfortunately not really. It should be isolated and locked. So it would be a !LRU but locked folio. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-21 16:25 ` David Hildenbrand @ 2024-12-21 21:59 ` Bernd Schubert 2024-12-23 19:00 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-21 21:59 UTC (permalink / raw) To: David Hildenbrand, Joanne Koong Cc: Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/21/24 17:25, David Hildenbrand wrote: > On 20.12.24 22:01, Joanne Koong wrote: >> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> >> wrote: >>> >>>>> I'm wondering if there would be a way to just "cancel" the >>>>> writeback and >>>>> mark the folio dirty again. That way it could be migrated, but not >>>>> reclaimed. At least we could avoid the whole >>>>> AS_WRITEBACK_INDETERMINATE >>>>> thing. >>>>> >>>> >>>> That is what I basically meant with short timeouts. Obviously it is not >>>> that simple to cancel the request and to retry - it would add in quite >>>> some complexity, if all the issues that arise can be solved at all. >>> >>> At least it would keep that out of core-mm. >>> >>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should >>> try to improve such scenarios, not acknowledge and integrate them, then >>> work around using timeouts that must be manually configured, and ca >>> likely no be default enabled because it could hurt reasonable use >>> cases :( >>> >>> Right now we clear the writeback flag immediately, indicating that data >>> was written back, when in fact it was not written back at all. I suspect >>> fsync() currently handles that manually already, to wait for any of the >>> allocated pages to actually get written back by user space, so we have >>> control over when something was *actually* written back. >>> >>> >>> Similar to your proposal, I wonder if there could be a way to request >>> fuse to "abort" a writeback request (instead of using fixed timeouts per >>> request). Meaning, when we stumble over a folio that is under writeback >>> on some paths, we would tell fuse to "end writeback now", or "end >>> writeback now if it takes longer than X". Essentially hidden inside >>> folio_wait_writeback(). >>> >>> When aborting a request, as I said, we would essentially "end writeback" >>> and mark the folio as dirty again. The interesting thing is likely how >>> to handle user space that wants to process this request right now (stuck >>> in fuse_send_writepage() I assume?), correct? >> >> This would be fine if the writeback request hasn't been sent yet to >> userspace but if it has and the pages are spliced > > Can you point me at the code where that splicing happens? fuse_dev_splice_read() fuse_dev_do_read() fuse_copy_args() fuse_copy_page Btw, for the non splice case, disabling migration should be only needed while it is copying to the userspace buffer? Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-21 21:59 ` Bernd Schubert @ 2024-12-23 19:00 ` Joanne Koong 2024-12-26 22:44 ` Bernd Schubert 0 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-12-23 19:00 UTC (permalink / raw) To: Bernd Schubert Cc: David Hildenbrand, Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Sat, Dec 21, 2024 at 1:59 PM Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > On 12/21/24 17:25, David Hildenbrand wrote: > > On 20.12.24 22:01, Joanne Koong wrote: > >> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> > >> wrote: > >>> > >>>>> I'm wondering if there would be a way to just "cancel" the > >>>>> writeback and > >>>>> mark the folio dirty again. That way it could be migrated, but not > >>>>> reclaimed. At least we could avoid the whole > >>>>> AS_WRITEBACK_INDETERMINATE > >>>>> thing. > >>>>> > >>>> > >>>> That is what I basically meant with short timeouts. Obviously it is not > >>>> that simple to cancel the request and to retry - it would add in quite > >>>> some complexity, if all the issues that arise can be solved at all. > >>> > >>> At least it would keep that out of core-mm. > >>> > >>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should > >>> try to improve such scenarios, not acknowledge and integrate them, then > >>> work around using timeouts that must be manually configured, and ca > >>> likely no be default enabled because it could hurt reasonable use > >>> cases :( > >>> > >>> Right now we clear the writeback flag immediately, indicating that data > >>> was written back, when in fact it was not written back at all. I suspect > >>> fsync() currently handles that manually already, to wait for any of the > >>> allocated pages to actually get written back by user space, so we have > >>> control over when something was *actually* written back. > >>> > >>> > >>> Similar to your proposal, I wonder if there could be a way to request > >>> fuse to "abort" a writeback request (instead of using fixed timeouts per > >>> request). Meaning, when we stumble over a folio that is under writeback > >>> on some paths, we would tell fuse to "end writeback now", or "end > >>> writeback now if it takes longer than X". Essentially hidden inside > >>> folio_wait_writeback(). > >>> > >>> When aborting a request, as I said, we would essentially "end writeback" > >>> and mark the folio as dirty again. The interesting thing is likely how > >>> to handle user space that wants to process this request right now (stuck > >>> in fuse_send_writepage() I assume?), correct? > >> > >> This would be fine if the writeback request hasn't been sent yet to > >> userspace but if it has and the pages are spliced > > > > Can you point me at the code where that splicing happens? > > fuse_dev_splice_read() > fuse_dev_do_read() > fuse_copy_args() > fuse_copy_page > > > Btw, for the non splice case, disabling migration should be > only needed while it is copying to the userspace buffer? I don't think so. We don't currently disable migration when copying to/from the userspace buffer for reads. Thanks, Joanne > > > > Thanks, > Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-23 19:00 ` Joanne Koong @ 2024-12-26 22:44 ` Bernd Schubert 2024-12-27 18:25 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-26 22:44 UTC (permalink / raw) To: Joanne Koong Cc: David Hildenbrand, Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/23/24 20:00, Joanne Koong wrote: > On Sat, Dec 21, 2024 at 1:59 PM Bernd Schubert > <bernd.schubert@fastmail.fm> wrote: >> >> >> >> On 12/21/24 17:25, David Hildenbrand wrote: >>> On 20.12.24 22:01, Joanne Koong wrote: >>>> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> >>>> wrote: >>>>> >>>>>>> I'm wondering if there would be a way to just "cancel" the >>>>>>> writeback and >>>>>>> mark the folio dirty again. That way it could be migrated, but not >>>>>>> reclaimed. At least we could avoid the whole >>>>>>> AS_WRITEBACK_INDETERMINATE >>>>>>> thing. >>>>>>> >>>>>> >>>>>> That is what I basically meant with short timeouts. Obviously it is not >>>>>> that simple to cancel the request and to retry - it would add in quite >>>>>> some complexity, if all the issues that arise can be solved at all. >>>>> >>>>> At least it would keep that out of core-mm. >>>>> >>>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should >>>>> try to improve such scenarios, not acknowledge and integrate them, then >>>>> work around using timeouts that must be manually configured, and ca >>>>> likely no be default enabled because it could hurt reasonable use >>>>> cases :( >>>>> >>>>> Right now we clear the writeback flag immediately, indicating that data >>>>> was written back, when in fact it was not written back at all. I suspect >>>>> fsync() currently handles that manually already, to wait for any of the >>>>> allocated pages to actually get written back by user space, so we have >>>>> control over when something was *actually* written back. >>>>> >>>>> >>>>> Similar to your proposal, I wonder if there could be a way to request >>>>> fuse to "abort" a writeback request (instead of using fixed timeouts per >>>>> request). Meaning, when we stumble over a folio that is under writeback >>>>> on some paths, we would tell fuse to "end writeback now", or "end >>>>> writeback now if it takes longer than X". Essentially hidden inside >>>>> folio_wait_writeback(). >>>>> >>>>> When aborting a request, as I said, we would essentially "end writeback" >>>>> and mark the folio as dirty again. The interesting thing is likely how >>>>> to handle user space that wants to process this request right now (stuck >>>>> in fuse_send_writepage() I assume?), correct? >>>> >>>> This would be fine if the writeback request hasn't been sent yet to >>>> userspace but if it has and the pages are spliced >>> >>> Can you point me at the code where that splicing happens? >> >> fuse_dev_splice_read() >> fuse_dev_do_read() >> fuse_copy_args() >> fuse_copy_page >> >> >> Btw, for the non splice case, disabling migration should be >> only needed while it is copying to the userspace buffer? > > I don't think so. We don't currently disable migration when copying > to/from the userspace buffer for reads. Sorry for my late reply. I'm confused about "reads". This discussions is about writeback? Without your patches we have tmp-pages - migration disabled on these. With your patches we have AS_WRITEBACK_INDETERMINATE - migration also disabled? I think we have two code paths a) fuse_dev_read - does a full buffer copy. Why do we need tmp-pages for these at all? The only time migration must not run on these pages while it is copying to the userspace buffer? b) fuse_dev_splice_read - isn't this our real problem, as we don't know when pages in the pipe are getting consumed? Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-26 22:44 ` Bernd Schubert @ 2024-12-27 18:25 ` Joanne Koong 0 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2024-12-27 18:25 UTC (permalink / raw) To: Bernd Schubert Cc: David Hildenbrand, Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 26, 2024 at 2:44 PM Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > On 12/23/24 20:00, Joanne Koong wrote: > > On Sat, Dec 21, 2024 at 1:59 PM Bernd Schubert > > <bernd.schubert@fastmail.fm> wrote: > >> > >> > >> > >> On 12/21/24 17:25, David Hildenbrand wrote: > >>> On 20.12.24 22:01, Joanne Koong wrote: > >>>> On Fri, Dec 20, 2024 at 6:49 AM David Hildenbrand <david@redhat.com> > >>>> wrote: > >>>>> > >>>>>>> I'm wondering if there would be a way to just "cancel" the > >>>>>>> writeback and > >>>>>>> mark the folio dirty again. That way it could be migrated, but not > >>>>>>> reclaimed. At least we could avoid the whole > >>>>>>> AS_WRITEBACK_INDETERMINATE > >>>>>>> thing. > >>>>>>> > >>>>>> > >>>>>> That is what I basically meant with short timeouts. Obviously it is not > >>>>>> that simple to cancel the request and to retry - it would add in quite > >>>>>> some complexity, if all the issues that arise can be solved at all. > >>>>> > >>>>> At least it would keep that out of core-mm. > >>>>> > >>>>> AS_WRITEBACK_INDETERMINATE really has weird smell to it ... we should > >>>>> try to improve such scenarios, not acknowledge and integrate them, then > >>>>> work around using timeouts that must be manually configured, and ca > >>>>> likely no be default enabled because it could hurt reasonable use > >>>>> cases :( > >>>>> > >>>>> Right now we clear the writeback flag immediately, indicating that data > >>>>> was written back, when in fact it was not written back at all. I suspect > >>>>> fsync() currently handles that manually already, to wait for any of the > >>>>> allocated pages to actually get written back by user space, so we have > >>>>> control over when something was *actually* written back. > >>>>> > >>>>> > >>>>> Similar to your proposal, I wonder if there could be a way to request > >>>>> fuse to "abort" a writeback request (instead of using fixed timeouts per > >>>>> request). Meaning, when we stumble over a folio that is under writeback > >>>>> on some paths, we would tell fuse to "end writeback now", or "end > >>>>> writeback now if it takes longer than X". Essentially hidden inside > >>>>> folio_wait_writeback(). > >>>>> > >>>>> When aborting a request, as I said, we would essentially "end writeback" > >>>>> and mark the folio as dirty again. The interesting thing is likely how > >>>>> to handle user space that wants to process this request right now (stuck > >>>>> in fuse_send_writepage() I assume?), correct? > >>>> > >>>> This would be fine if the writeback request hasn't been sent yet to > >>>> userspace but if it has and the pages are spliced > >>> > >>> Can you point me at the code where that splicing happens? > >> > >> fuse_dev_splice_read() > >> fuse_dev_do_read() > >> fuse_copy_args() > >> fuse_copy_page > >> > >> > >> Btw, for the non splice case, disabling migration should be > >> only needed while it is copying to the userspace buffer? > > > > I don't think so. We don't currently disable migration when copying > > to/from the userspace buffer for reads. > > > Sorry for my late reply. I'm confused about "reads". This discussions > is about writeback? Whether we need to disable migration for copying to/from the userspace buffers for non-tmp pages should be the same between handling reads or writes, no? That's why I brought up reads, but looking more at how fuse handles readahead and read_folio(), it looks like the folio's lock is held while it's being copied out, and IIUC that's enough to disable migration since migration will wait on the lock. So if we end writeback on the non-tmp, it seems like we'd probably need to do something similar first. > Without your patches we have tmp-pages - migration disabled on these. > With your patches we have AS_WRITEBACK_INDETERMINATE - migration > also disabled? > > I think we have two code paths > > a) fuse_dev_read - does a full buffer copy. Why do we need tmp-pages > for these at all? The only time migration must not run on these pages > while it is copying to the userspace buffer? The tmp pages were originally introduced for avoiding deadlock on reclaim and avoiding hanging sync()s as well. [1] https://lore.kernel.org/linux-kernel/bd49fcba-3eb6-4e84-a0f0-e73bce31ddb2@linux.alibaba.com/ > > b) fuse_dev_splice_read - isn't this our real problem, as we don't > know when pages in the pipe are getting consumed? Yes, the splice case nixes the idea unfortunately. Everything else we could find a workaround for, but there's no way I can see to avoid this for splice Thanks, Joanne > > > Thanks, > Bernd > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:26 ` David Hildenbrand 2024-12-19 17:30 ` Bernd Schubert @ 2024-12-19 17:55 ` Joanne Koong 2024-12-19 18:04 ` Bernd Schubert 1 sibling, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-12-19 17:55 UTC (permalink / raw) To: David Hildenbrand Cc: Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 9:26 AM David Hildenbrand <david@redhat.com> wrote: > > On 19.12.24 18:14, Shakeel Butt wrote: > > On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: > >> On 19.12.24 17:40, Shakeel Butt wrote: > >>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: > >>> [...] > >>>>> > >>>>> If you check the code just above this patch, this > >>>>> mapping_writeback_indeterminate() check only happen for pages under > >>>>> writeback which is a temp state. Anyways, fuse folios should not be > >>>>> unmovable for their lifetime but only while under writeback which is > >>>>> same for all fs. > >>>> > >>>> But there, writeback is expected to be a temporary thing, not possibly: > >>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. > >>>> > >>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA > >>>> guarantees, and unfortunately, it sounds like this is the case here, unless > >>>> I am missing something important. > >>>> > >>> > >>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing > >>> the confusion. The writeback state is not indefinite. A proper fuse fs, > >>> like anyother fs, should handle writeback pages appropriately. These > >>> additional checks and skips are for (I think) untrusted fuse servers. > >> > >> Can unprivileged user space provoke this case? > > > > Let's ask Joanne and other fuse folks about the above question. > > > > Let's say unprivileged user space can start a untrusted fuse server, > > mount fuse, allocate and dirty a lot of fuse folios (within its dirty > > and memcg limits) and trigger the writeback. To cause pain (through > > fragmentation), it is not clearing the writeback state. Is this the > > scenario you are envisioning? > This scenario can already happen with temp pages. An untrusted malicious fuse server may allocate and dirty a lot of fuse folios within its dirty/memcg limits and never clear writeback on any of them and tie up system resources. This certainly isn't the common case, but it is a possibility. However, request timeouts can be set by the system admin [1] to protect against malicious/buggy fuse servers that try to do this. If the request isn't replied to by a certain amount of time, then the connection will be aborted and writeback state and other resources will be cleared/freed. Thanks, Joanne [1] https://lore.kernel.org/linux-fsdevel/20241218222630.99920-1-joannelkoong@gmail.com/T/#t > Yes, for example causing harm on a shared host (containers, ...). > > If it cannot happen, we should make it very clear in documentation and > patch descriptions that it can only cause harm with privileged user > space, and that this harm can make things like CMA allocations, memory > onplug, ... fail, which is rather bad and against concepts like > ZONE_MOVABLE/MIGRATE_CMA. > > Although I wonder what would happen if the privileged user space daemon > crashes (e.g., OOM killer?) and simply no longer replies to any messages. > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 17:55 ` Joanne Koong @ 2024-12-19 18:04 ` Bernd Schubert 2024-12-19 18:11 ` Shakeel Butt 0 siblings, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2024-12-19 18:04 UTC (permalink / raw) To: Joanne Koong, David Hildenbrand Cc: Shakeel Butt, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On 12/19/24 18:55, Joanne Koong wrote: > On Thu, Dec 19, 2024 at 9:26 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.12.24 18:14, Shakeel Butt wrote: >>> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: >>>> On 19.12.24 17:40, Shakeel Butt wrote: >>>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: >>>>> [...] >>>>>>> >>>>>>> If you check the code just above this patch, this >>>>>>> mapping_writeback_indeterminate() check only happen for pages under >>>>>>> writeback which is a temp state. Anyways, fuse folios should not be >>>>>>> unmovable for their lifetime but only while under writeback which is >>>>>>> same for all fs. >>>>>> >>>>>> But there, writeback is expected to be a temporary thing, not possibly: >>>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. >>>>>> >>>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA >>>>>> guarantees, and unfortunately, it sounds like this is the case here, unless >>>>>> I am missing something important. >>>>>> >>>>> >>>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing >>>>> the confusion. The writeback state is not indefinite. A proper fuse fs, >>>>> like anyother fs, should handle writeback pages appropriately. These >>>>> additional checks and skips are for (I think) untrusted fuse servers. >>>> >>>> Can unprivileged user space provoke this case? >>> >>> Let's ask Joanne and other fuse folks about the above question. >>> >>> Let's say unprivileged user space can start a untrusted fuse server, >>> mount fuse, allocate and dirty a lot of fuse folios (within its dirty >>> and memcg limits) and trigger the writeback. To cause pain (through >>> fragmentation), it is not clearing the writeback state. Is this the >>> scenario you are envisioning? >> > > This scenario can already happen with temp pages. An untrusted > malicious fuse server may allocate and dirty a lot of fuse folios > within its dirty/memcg limits and never clear writeback on any of them > and tie up system resources. This certainly isn't the common case, but > it is a possibility. However, request timeouts can be set by the > system admin [1] to protect against malicious/buggy fuse servers that > try to do this. If the request isn't replied to by a certain amount of > time, then the connection will be aborted and writeback state and > other resources will be cleared/freed. > I think what Zi points out that that is a current implementation issue and these temp pages should be in a continues range. Obviously better to avoid a tmp copy at all. Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 18:04 ` Bernd Schubert @ 2024-12-19 18:11 ` Shakeel Butt 0 siblings, 0 replies; 124+ messages in thread From: Shakeel Butt @ 2024-12-19 18:11 UTC (permalink / raw) To: Bernd Schubert Cc: Joanne Koong, David Hildenbrand, Zi Yan, miklos, linux-fsdevel, jefflexu, josef, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 07:04:40PM +0100, Bernd Schubert wrote: > > > On 12/19/24 18:55, Joanne Koong wrote: > > On Thu, Dec 19, 2024 at 9:26 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 19.12.24 18:14, Shakeel Butt wrote: > >>> On Thu, Dec 19, 2024 at 05:41:36PM +0100, David Hildenbrand wrote: > >>>> On 19.12.24 17:40, Shakeel Butt wrote: > >>>>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: > >>>>> [...] > >>>>>>> > >>>>>>> If you check the code just above this patch, this > >>>>>>> mapping_writeback_indeterminate() check only happen for pages under > >>>>>>> writeback which is a temp state. Anyways, fuse folios should not be > >>>>>>> unmovable for their lifetime but only while under writeback which is > >>>>>>> same for all fs. > >>>>>> > >>>>>> But there, writeback is expected to be a temporary thing, not possibly: > >>>>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. > >>>>>> > >>>>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA > >>>>>> guarantees, and unfortunately, it sounds like this is the case here, unless > >>>>>> I am missing something important. > >>>>>> > >>>>> > >>>>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing > >>>>> the confusion. The writeback state is not indefinite. A proper fuse fs, > >>>>> like anyother fs, should handle writeback pages appropriately. These > >>>>> additional checks and skips are for (I think) untrusted fuse servers. > >>>> > >>>> Can unprivileged user space provoke this case? > >>> > >>> Let's ask Joanne and other fuse folks about the above question. > >>> > >>> Let's say unprivileged user space can start a untrusted fuse server, > >>> mount fuse, allocate and dirty a lot of fuse folios (within its dirty > >>> and memcg limits) and trigger the writeback. To cause pain (through > >>> fragmentation), it is not clearing the writeback state. Is this the > >>> scenario you are envisioning? > >> > > > > This scenario can already happen with temp pages. An untrusted > > malicious fuse server may allocate and dirty a lot of fuse folios > > within its dirty/memcg limits and never clear writeback on any of them > > and tie up system resources. This certainly isn't the common case, but > > it is a possibility. However, request timeouts can be set by the > > system admin [1] to protect against malicious/buggy fuse servers that > > try to do this. If the request isn't replied to by a certain amount of > > time, then the connection will be aborted and writeback state and > > other resources will be cleared/freed. > > > > I think what Zi points out that that is a current implementation issue > and these temp pages should be in a continues range. > Obviously better to avoid a tmp copy at all. The current tmp pages are allocated from MIGRATE_UNMOVABLE. I don't see any additional benefit of reserving any continuous unmovable memory regions for tmp pages. It will just add complexity without any clear benefit. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 16:41 ` David Hildenbrand 2024-12-19 17:14 ` Shakeel Butt @ 2024-12-20 7:55 ` Jingbo Xu 1 sibling, 0 replies; 124+ messages in thread From: Jingbo Xu @ 2024-12-20 7:55 UTC (permalink / raw) To: David Hildenbrand, Shakeel Butt Cc: Zi Yan, Joanne Koong, miklos, linux-fsdevel, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Oscar Salvador, Michal Hocko Hi, On 12/20/24 12:41 AM, David Hildenbrand wrote: > On 19.12.24 17:40, Shakeel Butt wrote: >> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: >> [...] >>>> >>>> If you check the code just above this patch, this >>>> mapping_writeback_indeterminate() check only happen for pages under >>>> writeback which is a temp state. Anyways, fuse folios should not be >>>> unmovable for their lifetime but only while under writeback which is >>>> same for all fs. >>> >>> But there, writeback is expected to be a temporary thing, not possibly: >>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. >>> >>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA >>> guarantees, and unfortunately, it sounds like this is the case here, >>> unless >>> I am missing something important. >>> >> >> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing >> the confusion. The writeback state is not indefinite. A proper fuse fs, >> like anyother fs, should handle writeback pages appropriately. These >> additional checks and skips are for (I think) untrusted fuse servers. > > Can unprivileged user space provoke this case? > There are some details on the initial problem that FUSE community wants to fix [1]. In summary, a non-malicious fuse daemon may need to allocate some memory when processing a FUSE_WRITE request (initiated from the writeback routine), in which case memory reclaim and compaction is triggered when allocating memory, which in turn leads to waiting on the writeback of **FUSE** dirty pages (which itself waits for the fuse daemon to handle it) - a deadlock here. The current FUSE implementation fixes this by introducing "temp page" in the writeback routine for FUSE. In short, a temporary page (allocated from ZONE_UNMOVABLE) is allocated for each dirty page cache needs to be written back. The content is copied from the original page cache to the temporary page. And then the original page cache (to writeback, allocated from ZONE_MOVABLE) clears PG_writeback bit immediately, so that the fuse daemon won't possibly stuck in deadlock waiting for the writeback of FUSE page cache. Instead, the actual writeback work is done upon the cloned temporary page then. Thus there are actually two pages for each FUSE page cache, one is the original FUSE page cache (in ZONE_MOVABLE) and the other is the temporary page (in ZONE_UNMOVABLE). - For the original page cache, it will clear PG_writeback bit very quickly in the writeback routine and won't block the memory direct reclaim and compaction at all - As for the temporary page, in the normal case, the fuse server will complete FUSE_WRITE request as expected, and thus the temporary page will get freed soon. However FUSE supports unprivileged mount, in which case the fuse daemon is run and mounted by an unprivileged user. Thus the backend fuse daemon may be malicious (started by an unprivileged user) and refuses to process any FUSE requests. Thus in the worst case, these temporary pages will never complete writeback and get pinned in ZONE_UNMOVABLE forever. (One thing worth noting is that, once the fuse daemon gets killed, the whole FUSE filesystem will be aborted, all inflight FUSE requests are flushed, and all the temporary pages will be freed then) What this patchset does is to drop the temporary page design in the FUSE writeback routine, while this patch is introduced to avoid the above mentioned deadlock for a *sane* FUSE daemon in memory compaction after dropping the temp page design. Currently the FUSE writeback pages (i.e. FUSE page cache) is allocated from GFP_HIGHUSER_MOVABLE, which is consistent with other filesystems. In the normal case (the FUSE is backed by a well-behaved FUSE daemon), the page cache will be completed in a reasonable manner and it won't affect the usability of ZONE_MOVABLE. While in the worst case (a malicious FUSE daemon run by an unprivileged user), these page cache in ZONE_MOVABLE can be pinned there indefinitely. We can argue that in the current implementation (without this patch series), ZONE_UNMOVABLE can also grow larger and larger, and pin quite many memory usage (correct me if I'm wrong) in the worst case. In this degree this patch doesn't make things even worse. Besides FUSE enables strictlimit feature by default, in which each FUSE filesystem can consume at most 1% of global vm.dirty_background_thresh before write throttle is triggered. [1] https://lore.kernel.org/all/8eec0912-7a6c-4387-b9be-6718f438a111@linux.alibaba.com/ -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2024-12-19 13:05 ` David Hildenbrand 2024-12-19 14:19 ` Zi Yan 2024-12-19 15:43 ` Shakeel Butt @ 2025-04-02 21:34 ` Joanne Koong 2025-04-03 3:31 ` Jingbo Xu 2 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2025-04-02 21:34 UTC (permalink / raw) To: David Hildenbrand Cc: miklos, linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote: > > On 23.11.24 00:23, Joanne Koong wrote: > > For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > > it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > > mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > > writeback may take an indeterminate amount of time to complete, and > > waits may get stuck. > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > > --- > > mm/migrate.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index df91248755e4..fe73284e5246 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > > */ > > switch (mode) { > > case MIGRATE_SYNC: > > - break; > > + if (!src->mapping || > > + !mapping_writeback_indeterminate(src->mapping)) > > + break; > > + fallthrough; > > default: > > rc = -EBUSY; > > goto out; > > Ehm, doesn't this mean that any fuse user can essentially completely > block CMA allocations, memory compaction, memory hotunplug, memory > poisoning... ?! > > That sounds very bad. I took a closer look at the migration code and the FUSE code. In the migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC mode folio lock holds will block migration until that folio is unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: if (!folio_trylock(src)) { if (mode == MIGRATE_ASYNC) goto out; if (current->flags & PF_MEMALLOC) goto out; if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) goto out; folio_lock(src); } If this is all that is needed for a malicious FUSE server to block migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE mappings are skipped in migration. A malicious server has easier and more powerful ways of blocking migration in FUSE than trying to do it through writeback. For a malicious fuse server, we in fact wouldn't even get far enough to hit writeback - a write triggers aops->write_begin() and a malicious server would deliberately hang forever while the folio is locked in write_begin(). I looked into whether we could eradicate all the places in FUSE where we may hold the folio lock for an indeterminate amount of time, because if that is possible, then we should not add this writeback way for a malicious fuse server to affect migration. But I don't think we can, for example taking one case, the folio lock needs to be held as we read in the folio from the server when servicing page faults, else the page cache would contain stale data if there was a concurrent write that happened just before, which would lead to data corruption in the filesystem. Imo, we need a more encompassing solution for all these cases if we're serious about preventing FUSE from blocking migration, which probably looks like a globally enforced default timeout of some sort or an mm solution for mitigating the blast radius of how much memory can be blocked from migration, but that is outside the scope of this patchset and is its own standalone topic. I don't see how this patch has any additional negative impact on memory migration for the case of malicious servers that the server can't already (and more easily) do. In fact, this patchset if anything helps memory given that malicious servers now can't also trigger page allocations for temp pages that would never get freed. Thanks, Joanne > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-04-02 21:34 ` Joanne Koong @ 2025-04-03 3:31 ` Jingbo Xu 2025-04-03 9:18 ` David Hildenbrand 0 siblings, 1 reply; 124+ messages in thread From: Jingbo Xu @ 2025-04-03 3:31 UTC (permalink / raw) To: Joanne Koong, David Hildenbrand Cc: miklos, linux-fsdevel, shakeel.butt, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On 4/3/25 5:34 AM, Joanne Koong wrote: > On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 23.11.24 00:23, Joanne Koong wrote: >>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>> writeback may take an indeterminate amount of time to complete, and >>> waits may get stuck. >>> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>> --- >>> mm/migrate.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index df91248755e4..fe73284e5246 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>> */ >>> switch (mode) { >>> case MIGRATE_SYNC: >>> - break; >>> + if (!src->mapping || >>> + !mapping_writeback_indeterminate(src->mapping)) >>> + break; >>> + fallthrough; >>> default: >>> rc = -EBUSY; >>> goto out; >> >> Ehm, doesn't this mean that any fuse user can essentially completely >> block CMA allocations, memory compaction, memory hotunplug, memory >> poisoning... ?! >> >> That sounds very bad. > > I took a closer look at the migration code and the FUSE code. In the > migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > mode folio lock holds will block migration until that folio is > unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > > if (!folio_trylock(src)) { > if (mode == MIGRATE_ASYNC) > goto out; > > if (current->flags & PF_MEMALLOC) > goto out; > > if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) > goto out; > > folio_lock(src); > } > > If this is all that is needed for a malicious FUSE server to block > migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > mappings are skipped in migration. A malicious server has easier and > more powerful ways of blocking migration in FUSE than trying to do it > through writeback. For a malicious fuse server, we in fact wouldn't > even get far enough to hit writeback - a write triggers > aops->write_begin() and a malicious server would deliberately hang > forever while the folio is locked in write_begin(). Indeed it seems possible. A malicious FUSE server may already be capable of blocking the synchronous migration in this way. > > I looked into whether we could eradicate all the places in FUSE where > we may hold the folio lock for an indeterminate amount of time, > because if that is possible, then we should not add this writeback way > for a malicious fuse server to affect migration. But I don't think we > can, for example taking one case, the folio lock needs to be held as > we read in the folio from the server when servicing page faults, else > the page cache would contain stale data if there was a concurrent > write that happened just before, which would lead to data corruption > in the filesystem. Imo, we need a more encompassing solution for all > these cases if we're serious about preventing FUSE from blocking > migration, which probably looks like a globally enforced default > timeout of some sort or an mm solution for mitigating the blast radius > of how much memory can be blocked from migration, but that is outside > the scope of this patchset and is its own standalone topic. > > I don't see how this patch has any additional negative impact on > memory migration for the case of malicious servers that the server > can't already (and more easily) do. In fact, this patchset if anything > helps memory given that malicious servers now can't also trigger page > allocations for temp pages that would never get freed. > If that's true, maybe we could drop this patch out of this patchset? So that both before and after this patchset, synchronous migration could be blocked by a malicious FUSE server, while the usability of continuous memory (CMA) won't be affected. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-04-03 3:31 ` Jingbo Xu @ 2025-04-03 9:18 ` David Hildenbrand 2025-04-03 9:25 ` Bernd Schubert 2025-04-03 19:09 ` Joanne Koong 0 siblings, 2 replies; 124+ messages in thread From: David Hildenbrand @ 2025-04-03 9:18 UTC (permalink / raw) To: Jingbo Xu, Joanne Koong Cc: miklos, linux-fsdevel, shakeel.butt, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On 03.04.25 05:31, Jingbo Xu wrote: > > > On 4/3/25 5:34 AM, Joanne Koong wrote: >> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 23.11.24 00:23, Joanne Koong wrote: >>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>> writeback may take an indeterminate amount of time to complete, and >>>> waits may get stuck. >>>> >>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>> --- >>>> mm/migrate.c | 5 ++++- >>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>> index df91248755e4..fe73284e5246 100644 >>>> --- a/mm/migrate.c >>>> +++ b/mm/migrate.c >>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>> */ >>>> switch (mode) { >>>> case MIGRATE_SYNC: >>>> - break; >>>> + if (!src->mapping || >>>> + !mapping_writeback_indeterminate(src->mapping)) >>>> + break; >>>> + fallthrough; >>>> default: >>>> rc = -EBUSY; >>>> goto out; >>> >>> Ehm, doesn't this mean that any fuse user can essentially completely >>> block CMA allocations, memory compaction, memory hotunplug, memory >>> poisoning... ?! >>> >>> That sounds very bad. >> >> I took a closer look at the migration code and the FUSE code. In the >> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC >> mode folio lock holds will block migration until that folio is >> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: >> >> if (!folio_trylock(src)) { >> if (mode == MIGRATE_ASYNC) >> goto out; >> >> if (current->flags & PF_MEMALLOC) >> goto out; >> >> if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) >> goto out; >> >> folio_lock(src); >> } >> Right, I raised that also in my LSF/MM talk: waiting for readahead currently implies waiting for the folio lock (there is no separate readahead flag like there would be for writeback). The more I look into this and fuse, the more I realize that what fuse does is just completely broken right now. >> If this is all that is needed for a malicious FUSE server to block >> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE >> mappings are skipped in migration. A malicious server has easier and >> more powerful ways of blocking migration in FUSE than trying to do it >> through writeback. For a malicious fuse server, we in fact wouldn't >> even get far enough to hit writeback - a write triggers >> aops->write_begin() and a malicious server would deliberately hang >> forever while the folio is locked in write_begin(). > > Indeed it seems possible. A malicious FUSE server may already be > capable of blocking the synchronous migration in this way. Yes, I think the conclusion is that we should advise people from not using unprivileged FUSE if they care about any features that rely on page migration or page reclaim. > > >> >> I looked into whether we could eradicate all the places in FUSE where >> we may hold the folio lock for an indeterminate amount of time, >> because if that is possible, then we should not add this writeback way >> for a malicious fuse server to affect migration. But I don't think we >> can, for example taking one case, the folio lock needs to be held as >> we read in the folio from the server when servicing page faults, else >> the page cache would contain stale data if there was a concurrent >> write that happened just before, which would lead to data corruption >> in the filesystem. Imo, we need a more encompassing solution for all >> these cases if we're serious about preventing FUSE from blocking >> migration, which probably looks like a globally enforced default >> timeout of some sort or an mm solution for mitigating the blast radius >> of how much memory can be blocked from migration, but that is outside >> the scope of this patchset and is its own standalone topic. I'm still skeptical about timeouts: we can only get it wrong. I think a proper solution is making these pages movable, which does seem feasible if (a) splice is not involved and (b) we can find a way to not hold the folio lock forever e.g., in the readahead case. Maybe readahead would have to be handled more similar to writeback (e.g., having a separate flag, or using a combination of e.g., writeback+uptodate flag, not sure) In both cases (readahead+writeback), we'd want to call into the FS to migrate a folio that is under readahread/writeback. In case of fuse without splice, a migration might be doable, and as discussed, splice might just be avoided. >> >> I don't see how this patch has any additional negative impact on >> memory migration for the case of malicious servers that the server >> can't already (and more easily) do. In fact, this patchset if anything >> helps memory given that malicious servers now can't also trigger page >> allocations for temp pages that would never get freed. >> > > If that's true, maybe we could drop this patch out of this patchset? So > that both before and after this patchset, synchronous migration could be > blocked by a malicious FUSE server, while the usability of continuous > memory (CMA) won't be affected. I had exactly the same thought: if we can block forever on the folio lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all completely broken. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-04-03 9:18 ` David Hildenbrand @ 2025-04-03 9:25 ` Bernd Schubert 2025-04-03 9:35 ` Christian Brauner 2025-04-03 19:09 ` Joanne Koong 1 sibling, 1 reply; 124+ messages in thread From: Bernd Schubert @ 2025-04-03 9:25 UTC (permalink / raw) To: David Hildenbrand, Jingbo Xu, Joanne Koong Cc: miklos, linux-fsdevel, shakeel.butt, josef, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko, Keith Busch On 4/3/25 11:18, David Hildenbrand wrote: > On 03.04.25 05:31, Jingbo Xu wrote: >> >> >> On 4/3/25 5:34 AM, Joanne Koong wrote: >>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> >>> wrote: >>>> >>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the >>>>> folio if >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag >>>>> set on its >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the >>>>> mapping, the >>>>> writeback may take an indeterminate amount of time to complete, and >>>>> waits may get stuck. >>>>> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>> --- >>>>> mm/migrate.c | 5 ++++- >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>> index df91248755e4..fe73284e5246 100644 >>>>> --- a/mm/migrate.c >>>>> +++ b/mm/migrate.c >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t >>>>> get_new_folio, >>>>> */ >>>>> switch (mode) { >>>>> case MIGRATE_SYNC: >>>>> - break; >>>>> + if (!src->mapping || >>>>> + !mapping_writeback_indeterminate(src- >>>>> >mapping)) >>>>> + break; >>>>> + fallthrough; >>>>> default: >>>>> rc = -EBUSY; >>>>> goto out; >>>> >>>> Ehm, doesn't this mean that any fuse user can essentially completely >>>> block CMA allocations, memory compaction, memory hotunplug, memory >>>> poisoning... ?! >>>> >>>> That sounds very bad. >>> >>> I took a closer look at the migration code and the FUSE code. In the >>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC >>> mode folio lock holds will block migration until that folio is >>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: >>> >>> if (!folio_trylock(src)) { >>> if (mode == MIGRATE_ASYNC) >>> goto out; >>> >>> if (current->flags & PF_MEMALLOC) >>> goto out; >>> >>> if (mode == MIGRATE_SYNC_LIGHT && ! >>> folio_test_uptodate(src)) >>> goto out; >>> >>> folio_lock(src); >>> } >>> > > Right, I raised that also in my LSF/MM talk: waiting for readahead > currently implies waiting for the folio lock (there is no separate > readahead flag like there would be for writeback). > > The more I look into this and fuse, the more I realize that what fuse > does is just completely broken right now. > >>> If this is all that is needed for a malicious FUSE server to block >>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE >>> mappings are skipped in migration. A malicious server has easier and >>> more powerful ways of blocking migration in FUSE than trying to do it >>> through writeback. For a malicious fuse server, we in fact wouldn't >>> even get far enough to hit writeback - a write triggers >>> aops->write_begin() and a malicious server would deliberately hang >>> forever while the folio is locked in write_begin(). >> >> Indeed it seems possible. A malicious FUSE server may already be >> capable of blocking the synchronous migration in this way. > > Yes, I think the conclusion is that we should advise people from not > using unprivileged FUSE if they care about any features that rely on > page migration or page reclaim. > >> >> >>> >>> I looked into whether we could eradicate all the places in FUSE where >>> we may hold the folio lock for an indeterminate amount of time, >>> because if that is possible, then we should not add this writeback way >>> for a malicious fuse server to affect migration. But I don't think we >>> can, for example taking one case, the folio lock needs to be held as >>> we read in the folio from the server when servicing page faults, else >>> the page cache would contain stale data if there was a concurrent >>> write that happened just before, which would lead to data corruption >>> in the filesystem. Imo, we need a more encompassing solution for all >>> these cases if we're serious about preventing FUSE from blocking >>> migration, which probably looks like a globally enforced default >>> timeout of some sort or an mm solution for mitigating the blast radius >>> of how much memory can be blocked from migration, but that is outside >>> the scope of this patchset and is its own standalone topic. > > I'm still skeptical about timeouts: we can only get it wrong. > > I think a proper solution is making these pages movable, which does seem > feasible if (a) splice is not involved and (b) we can find a way to not > hold the folio lock forever e.g., in the readahead case. > > Maybe readahead would have to be handled more similar to writeback > (e.g., having a separate flag, or using a combination of e.g., > writeback+uptodate flag, not sure) > > In both cases (readahead+writeback), we'd want to call into the FS to > migrate a folio that is under readahread/writeback. In case of fuse > without splice, a migration might be doable, and as discussed, splice > might just be avoided. My personal take is here that we should move away from splice. Keith (or colleague) is working on ZC with io-uring anyway, so maybe a good timing. We should just ensure that the new approach doesn't have the same issue. Thanks, Bernd ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-04-03 9:25 ` Bernd Schubert @ 2025-04-03 9:35 ` Christian Brauner 0 siblings, 0 replies; 124+ messages in thread From: Christian Brauner @ 2025-04-03 9:35 UTC (permalink / raw) To: Bernd Schubert Cc: David Hildenbrand, Jingbo Xu, Joanne Koong, miklos, linux-fsdevel, shakeel.butt, josef, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko, Keith Busch On Thu, Apr 03, 2025 at 11:25:17AM +0200, Bernd Schubert wrote: > > > On 4/3/25 11:18, David Hildenbrand wrote: > > On 03.04.25 05:31, Jingbo Xu wrote: > >> > >> > >> On 4/3/25 5:34 AM, Joanne Koong wrote: > >>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> > >>> wrote: > >>>> > >>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the > >>>>> folio if > >>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag > >>>>> set on its > >>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the > >>>>> mapping, the > >>>>> writeback may take an indeterminate amount of time to complete, and > >>>>> waits may get stuck. > >>>>> > >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > >>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > >>>>> --- > >>>>> mm/migrate.c | 5 ++++- > >>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/mm/migrate.c b/mm/migrate.c > >>>>> index df91248755e4..fe73284e5246 100644 > >>>>> --- a/mm/migrate.c > >>>>> +++ b/mm/migrate.c > >>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t > >>>>> get_new_folio, > >>>>> */ > >>>>> switch (mode) { > >>>>> case MIGRATE_SYNC: > >>>>> - break; > >>>>> + if (!src->mapping || > >>>>> + !mapping_writeback_indeterminate(src- > >>>>> >mapping)) > >>>>> + break; > >>>>> + fallthrough; > >>>>> default: > >>>>> rc = -EBUSY; > >>>>> goto out; > >>>> > >>>> Ehm, doesn't this mean that any fuse user can essentially completely > >>>> block CMA allocations, memory compaction, memory hotunplug, memory > >>>> poisoning... ?! > >>>> > >>>> That sounds very bad. > >>> > >>> I took a closer look at the migration code and the FUSE code. In the > >>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >>> mode folio lock holds will block migration until that folio is > >>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > >>> > >>> if (!folio_trylock(src)) { > >>> if (mode == MIGRATE_ASYNC) > >>> goto out; > >>> > >>> if (current->flags & PF_MEMALLOC) > >>> goto out; > >>> > >>> if (mode == MIGRATE_SYNC_LIGHT && ! > >>> folio_test_uptodate(src)) > >>> goto out; > >>> > >>> folio_lock(src); > >>> } > >>> > > > > Right, I raised that also in my LSF/MM talk: waiting for readahead > > currently implies waiting for the folio lock (there is no separate > > readahead flag like there would be for writeback). > > > > The more I look into this and fuse, the more I realize that what fuse > > does is just completely broken right now. > > > >>> If this is all that is needed for a malicious FUSE server to block > >>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >>> mappings are skipped in migration. A malicious server has easier and > >>> more powerful ways of blocking migration in FUSE than trying to do it > >>> through writeback. For a malicious fuse server, we in fact wouldn't > >>> even get far enough to hit writeback - a write triggers > >>> aops->write_begin() and a malicious server would deliberately hang > >>> forever while the folio is locked in write_begin(). > >> > >> Indeed it seems possible. A malicious FUSE server may already be > >> capable of blocking the synchronous migration in this way. > > > > Yes, I think the conclusion is that we should advise people from not > > using unprivileged FUSE if they care about any features that rely on > > page migration or page reclaim. > > > >> > >> > >>> > >>> I looked into whether we could eradicate all the places in FUSE where > >>> we may hold the folio lock for an indeterminate amount of time, > >>> because if that is possible, then we should not add this writeback way > >>> for a malicious fuse server to affect migration. But I don't think we > >>> can, for example taking one case, the folio lock needs to be held as > >>> we read in the folio from the server when servicing page faults, else > >>> the page cache would contain stale data if there was a concurrent > >>> write that happened just before, which would lead to data corruption > >>> in the filesystem. Imo, we need a more encompassing solution for all > >>> these cases if we're serious about preventing FUSE from blocking > >>> migration, which probably looks like a globally enforced default > >>> timeout of some sort or an mm solution for mitigating the blast radius > >>> of how much memory can be blocked from migration, but that is outside > >>> the scope of this patchset and is its own standalone topic. > > > > I'm still skeptical about timeouts: we can only get it wrong. > > > > I think a proper solution is making these pages movable, which does seem > > feasible if (a) splice is not involved and (b) we can find a way to not > > hold the folio lock forever e.g., in the readahead case. > > > > Maybe readahead would have to be handled more similar to writeback > > (e.g., having a separate flag, or using a combination of e.g., > > writeback+uptodate flag, not sure) > > > > In both cases (readahead+writeback), we'd want to call into the FS to > > migrate a folio that is under readahread/writeback. In case of fuse > > without splice, a migration might be doable, and as discussed, splice > > might just be avoided. > > My personal take is here that we should move away from splice. > Keith (or colleague) is working on ZC with io-uring anyway, so > maybe a good timing. We should just ensure that the new approach > doesn't have the same issue. splice is problematic in a lot of other ways too. It's easy to abuse it for weird userspace hangs since it clings onto the pipe_lock() and no one wants to do the invasive surgery to wean it off of that. So +1 on avoiding splice. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-04-03 9:18 ` David Hildenbrand 2025-04-03 9:25 ` Bernd Schubert @ 2025-04-03 19:09 ` Joanne Koong 2025-04-03 20:44 ` David Hildenbrand 1 sibling, 1 reply; 124+ messages in thread From: Joanne Koong @ 2025-04-03 19:09 UTC (permalink / raw) To: David Hildenbrand Cc: Jingbo Xu, miklos, linux-fsdevel, shakeel.butt, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@redhat.com> wrote: > > On 03.04.25 05:31, Jingbo Xu wrote: > > > > > > On 4/3/25 5:34 AM, Joanne Koong wrote: > >> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> On 23.11.24 00:23, Joanne Koong wrote: > >>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > >>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > >>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > >>>> writeback may take an indeterminate amount of time to complete, and > >>>> waits may get stuck. > >>>> > >>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > >>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > >>>> --- > >>>> mm/migrate.c | 5 ++++- > >>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/mm/migrate.c b/mm/migrate.c > >>>> index df91248755e4..fe73284e5246 100644 > >>>> --- a/mm/migrate.c > >>>> +++ b/mm/migrate.c > >>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, > >>>> */ > >>>> switch (mode) { > >>>> case MIGRATE_SYNC: > >>>> - break; > >>>> + if (!src->mapping || > >>>> + !mapping_writeback_indeterminate(src->mapping)) > >>>> + break; > >>>> + fallthrough; > >>>> default: > >>>> rc = -EBUSY; > >>>> goto out; > >>> > >>> Ehm, doesn't this mean that any fuse user can essentially completely > >>> block CMA allocations, memory compaction, memory hotunplug, memory > >>> poisoning... ?! > >>> > >>> That sounds very bad. > >> > >> I took a closer look at the migration code and the FUSE code. In the > >> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >> mode folio lock holds will block migration until that folio is > >> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > >> > >> if (!folio_trylock(src)) { > >> if (mode == MIGRATE_ASYNC) > >> goto out; > >> > >> if (current->flags & PF_MEMALLOC) > >> goto out; > >> > >> if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) > >> goto out; > >> > >> folio_lock(src); > >> } > >> > > Right, I raised that also in my LSF/MM talk: waiting for readahead > currently implies waiting for the folio lock (there is no separate > readahead flag like there would be for writeback). > > The more I look into this and fuse, the more I realize that what fuse > does is just completely broken right now. > > >> If this is all that is needed for a malicious FUSE server to block > >> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >> mappings are skipped in migration. A malicious server has easier and > >> more powerful ways of blocking migration in FUSE than trying to do it > >> through writeback. For a malicious fuse server, we in fact wouldn't > >> even get far enough to hit writeback - a write triggers > >> aops->write_begin() and a malicious server would deliberately hang > >> forever while the folio is locked in write_begin(). > > > > Indeed it seems possible. A malicious FUSE server may already be > > capable of blocking the synchronous migration in this way. > > Yes, I think the conclusion is that we should advise people from not > using unprivileged FUSE if they care about any features that rely on > page migration or page reclaim. > > > > > > >> > >> I looked into whether we could eradicate all the places in FUSE where > >> we may hold the folio lock for an indeterminate amount of time, > >> because if that is possible, then we should not add this writeback way > >> for a malicious fuse server to affect migration. But I don't think we > >> can, for example taking one case, the folio lock needs to be held as > >> we read in the folio from the server when servicing page faults, else > >> the page cache would contain stale data if there was a concurrent > >> write that happened just before, which would lead to data corruption > >> in the filesystem. Imo, we need a more encompassing solution for all > >> these cases if we're serious about preventing FUSE from blocking > >> migration, which probably looks like a globally enforced default > >> timeout of some sort or an mm solution for mitigating the blast radius > >> of how much memory can be blocked from migration, but that is outside > >> the scope of this patchset and is its own standalone topic. > > I'm still skeptical about timeouts: we can only get it wrong. > > I think a proper solution is making these pages movable, which does seem > feasible if (a) splice is not involved and (b) we can find a way to not > hold the folio lock forever e.g., in the readahead case. > > Maybe readahead would have to be handled more similar to writeback > (e.g., having a separate flag, or using a combination of e.g., > writeback+uptodate flag, not sure) > > In both cases (readahead+writeback), we'd want to call into the FS to > migrate a folio that is under readahread/writeback. In case of fuse > without splice, a migration might be doable, and as discussed, splice > might just be avoided. > > >> > >> I don't see how this patch has any additional negative impact on > >> memory migration for the case of malicious servers that the server > >> can't already (and more easily) do. In fact, this patchset if anything > >> helps memory given that malicious servers now can't also trigger page > >> allocations for temp pages that would never get freed. > >> > > > > If that's true, maybe we could drop this patch out of this patchset? So > > that both before and after this patchset, synchronous migration could be > > blocked by a malicious FUSE server, while the usability of continuous > > memory (CMA) won't be affected. > > I had exactly the same thought: if we can block forever on the folio > lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all > completely broken. I will resubmit this patchset and drop this patch. I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy cgroupv1 reclaim scenarios: a) sync: sync waits on writeback so if we don't skip waiting on writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse servers could make syncs hang. (There's no actual effect on sync behavior though with temp pages because even without temp pages, we return even though the data hasn't actually been synced to disk by the server yet) b) cgroupv1 reclaim: a correctly written fuse server can fall into this deadlock in one very specific scenario (eg if it's using legacy cgroupv1 and reclaim encounters a folio that already has the reclaim flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap) set), where the deadlock is triggered by: * single-threaded FUSE server is in the middle of handling a request that needs a memory allocation * memory allocation triggers direct reclaim * direct reclaim waits on a folio under writeback * the FUSE server can't write back the folio since it's stuck in direct reclaim Thanks for the feedback and discussion, everyone. > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-04-03 19:09 ` Joanne Koong @ 2025-04-03 20:44 ` David Hildenbrand 2025-04-03 22:04 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: David Hildenbrand @ 2025-04-03 20:44 UTC (permalink / raw) To: Joanne Koong Cc: Jingbo Xu, miklos, linux-fsdevel, shakeel.butt, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On 03.04.25 21:09, Joanne Koong wrote: > On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 03.04.25 05:31, Jingbo Xu wrote: >>> >>> >>> On 4/3/25 5:34 AM, Joanne Koong wrote: >>>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 23.11.24 00:23, Joanne Koong wrote: >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the >>>>>> writeback may take an indeterminate amount of time to complete, and >>>>>> waits may get stuck. >>>>>> >>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> >>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >>>>>> --- >>>>>> mm/migrate.c | 5 ++++- >>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>>> index df91248755e4..fe73284e5246 100644 >>>>>> --- a/mm/migrate.c >>>>>> +++ b/mm/migrate.c >>>>>> @@ -1260,7 +1260,10 @@ static int migrate_folio_unmap(new_folio_t get_new_folio, >>>>>> */ >>>>>> switch (mode) { >>>>>> case MIGRATE_SYNC: >>>>>> - break; >>>>>> + if (!src->mapping || >>>>>> + !mapping_writeback_indeterminate(src->mapping)) >>>>>> + break; >>>>>> + fallthrough; >>>>>> default: >>>>>> rc = -EBUSY; >>>>>> goto out; >>>>> >>>>> Ehm, doesn't this mean that any fuse user can essentially completely >>>>> block CMA allocations, memory compaction, memory hotunplug, memory >>>>> poisoning... ?! >>>>> >>>>> That sounds very bad. >>>> >>>> I took a closer look at the migration code and the FUSE code. In the >>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC >>>> mode folio lock holds will block migration until that folio is >>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: >>>> >>>> if (!folio_trylock(src)) { >>>> if (mode == MIGRATE_ASYNC) >>>> goto out; >>>> >>>> if (current->flags & PF_MEMALLOC) >>>> goto out; >>>> >>>> if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) >>>> goto out; >>>> >>>> folio_lock(src); >>>> } >>>> >> >> Right, I raised that also in my LSF/MM talk: waiting for readahead >> currently implies waiting for the folio lock (there is no separate >> readahead flag like there would be for writeback). >> >> The more I look into this and fuse, the more I realize that what fuse >> does is just completely broken right now. >> >>>> If this is all that is needed for a malicious FUSE server to block >>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE >>>> mappings are skipped in migration. A malicious server has easier and >>>> more powerful ways of blocking migration in FUSE than trying to do it >>>> through writeback. For a malicious fuse server, we in fact wouldn't >>>> even get far enough to hit writeback - a write triggers >>>> aops->write_begin() and a malicious server would deliberately hang >>>> forever while the folio is locked in write_begin(). >>> >>> Indeed it seems possible. A malicious FUSE server may already be >>> capable of blocking the synchronous migration in this way. >> >> Yes, I think the conclusion is that we should advise people from not >> using unprivileged FUSE if they care about any features that rely on >> page migration or page reclaim. >> >>> >>> >>>> >>>> I looked into whether we could eradicate all the places in FUSE where >>>> we may hold the folio lock for an indeterminate amount of time, >>>> because if that is possible, then we should not add this writeback way >>>> for a malicious fuse server to affect migration. But I don't think we >>>> can, for example taking one case, the folio lock needs to be held as >>>> we read in the folio from the server when servicing page faults, else >>>> the page cache would contain stale data if there was a concurrent >>>> write that happened just before, which would lead to data corruption >>>> in the filesystem. Imo, we need a more encompassing solution for all >>>> these cases if we're serious about preventing FUSE from blocking >>>> migration, which probably looks like a globally enforced default >>>> timeout of some sort or an mm solution for mitigating the blast radius >>>> of how much memory can be blocked from migration, but that is outside >>>> the scope of this patchset and is its own standalone topic. >> >> I'm still skeptical about timeouts: we can only get it wrong. >> >> I think a proper solution is making these pages movable, which does seem >> feasible if (a) splice is not involved and (b) we can find a way to not >> hold the folio lock forever e.g., in the readahead case. >> >> Maybe readahead would have to be handled more similar to writeback >> (e.g., having a separate flag, or using a combination of e.g., >> writeback+uptodate flag, not sure) >> >> In both cases (readahead+writeback), we'd want to call into the FS to >> migrate a folio that is under readahread/writeback. In case of fuse >> without splice, a migration might be doable, and as discussed, splice >> might just be avoided. >> >>>> >>>> I don't see how this patch has any additional negative impact on >>>> memory migration for the case of malicious servers that the server >>>> can't already (and more easily) do. In fact, this patchset if anything >>>> helps memory given that malicious servers now can't also trigger page >>>> allocations for temp pages that would never get freed. >>>> >>> >>> If that's true, maybe we could drop this patch out of this patchset? So >>> that both before and after this patchset, synchronous migration could be >>> blocked by a malicious FUSE server, while the usability of continuous >>> memory (CMA) won't be affected. >> >> I had exactly the same thought: if we can block forever on the folio >> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all >> completely broken. > > I will resubmit this patchset and drop this patch. > > I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy > cgroupv1 reclaim scenarios: > a) sync: sync waits on writeback so if we don't skip waiting on > writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse > servers could make syncs hang. (There's no actual effect on sync > behavior though with temp pages because even without temp pages, we > return even though the data hasn't actually been synced to disk by the > server yet) Just curious: Are we sure there are no other cases where a malicious userspace could make some other folio_lock() hang forever either way? IOW, just like for migration, isn't this just solving one part of the whole problem we are facing? > > b) cgroupv1 reclaim: a correctly written fuse server can fall into > this deadlock in one very specific scenario (eg if it's using legacy > cgroupv1 and reclaim encounters a folio that already has the reclaim > flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap) > set), where the deadlock is triggered by: > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in direct reclaim Yes, that sounds reasonable. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings 2025-04-03 20:44 ` David Hildenbrand @ 2025-04-03 22:04 ` Joanne Koong 0 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2025-04-03 22:04 UTC (permalink / raw) To: David Hildenbrand Cc: Jingbo Xu, miklos, linux-fsdevel, shakeel.butt, josef, bernd.schubert, linux-mm, kernel-team, Matthew Wilcox, Zi Yan, Oscar Salvador, Michal Hocko On Thu, Apr 3, 2025 at 1:44 PM David Hildenbrand <david@redhat.com> wrote: > > On 03.04.25 21:09, Joanne Koong wrote: > > On Thu, Apr 3, 2025 at 2:18 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 03.04.25 05:31, Jingbo Xu wrote: > >>> > >>> > >>> On 4/3/25 5:34 AM, Joanne Koong wrote: > >>>> On Thu, Dec 19, 2024 at 5:05 AM David Hildenbrand <david@redhat.com> wrote: > >>>>> > >>>>> On 23.11.24 00:23, Joanne Koong wrote: > >>>>>> For migrations called in MIGRATE_SYNC mode, skip migrating the folio if > >>>>>> it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its > >>>>>> mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the > >>>>>> writeback may take an indeterminate amount of time to complete, and > >>>>>> waits may get stuck. > >>>>>> > >>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> > >>>>>> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> > >>>>>> --- > >>>>>> mm/migrate.c | 5 ++++- > >>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) > >>>>>> > >>>>> Ehm, doesn't this mean that any fuse user can essentially completely > >>>>> block CMA allocations, memory compaction, memory hotunplug, memory > >>>>> poisoning... ?! > >>>>> > >>>>> That sounds very bad. > >>>> > >>>> I took a closer look at the migration code and the FUSE code. In the > >>>> migration code in migrate_folio_unmap(), I see that any MIGATE_SYNC > >>>> mode folio lock holds will block migration until that folio is > >>>> unlocked. This is the snippet in migrate_folio_unmap() I'm looking at: > >>>> > >>>> if (!folio_trylock(src)) { > >>>> if (mode == MIGRATE_ASYNC) > >>>> goto out; > >>>> > >>>> if (current->flags & PF_MEMALLOC) > >>>> goto out; > >>>> > >>>> if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src)) > >>>> goto out; > >>>> > >>>> folio_lock(src); > >>>> } > >>>> > >> > >> Right, I raised that also in my LSF/MM talk: waiting for readahead > >> currently implies waiting for the folio lock (there is no separate > >> readahead flag like there would be for writeback). > >> > >> The more I look into this and fuse, the more I realize that what fuse > >> does is just completely broken right now. > >> > >>>> If this is all that is needed for a malicious FUSE server to block > >>>> migration, then it makes no difference if AS_WRITEBACK_INDETERMINATE > >>>> mappings are skipped in migration. A malicious server has easier and > >>>> more powerful ways of blocking migration in FUSE than trying to do it > >>>> through writeback. For a malicious fuse server, we in fact wouldn't > >>>> even get far enough to hit writeback - a write triggers > >>>> aops->write_begin() and a malicious server would deliberately hang > >>>> forever while the folio is locked in write_begin(). > >>> > >>> Indeed it seems possible. A malicious FUSE server may already be > >>> capable of blocking the synchronous migration in this way. > >> > >> Yes, I think the conclusion is that we should advise people from not > >> using unprivileged FUSE if they care about any features that rely on > >> page migration or page reclaim. > >> > >>> > >>> > >>>> > >>>> I looked into whether we could eradicate all the places in FUSE where > >>>> we may hold the folio lock for an indeterminate amount of time, > >>>> because if that is possible, then we should not add this writeback way > >>>> for a malicious fuse server to affect migration. But I don't think we > >>>> can, for example taking one case, the folio lock needs to be held as > >>>> we read in the folio from the server when servicing page faults, else > >>>> the page cache would contain stale data if there was a concurrent > >>>> write that happened just before, which would lead to data corruption > >>>> in the filesystem. Imo, we need a more encompassing solution for all > >>>> these cases if we're serious about preventing FUSE from blocking > >>>> migration, which probably looks like a globally enforced default > >>>> timeout of some sort or an mm solution for mitigating the blast radius > >>>> of how much memory can be blocked from migration, but that is outside > >>>> the scope of this patchset and is its own standalone topic. > >> > >> I'm still skeptical about timeouts: we can only get it wrong. > >> > >> I think a proper solution is making these pages movable, which does seem > >> feasible if (a) splice is not involved and (b) we can find a way to not > >> hold the folio lock forever e.g., in the readahead case. > >> > >> Maybe readahead would have to be handled more similar to writeback > >> (e.g., having a separate flag, or using a combination of e.g., > >> writeback+uptodate flag, not sure) > >> > >> In both cases (readahead+writeback), we'd want to call into the FS to > >> migrate a folio that is under readahread/writeback. In case of fuse > >> without splice, a migration might be doable, and as discussed, splice > >> might just be avoided. > >> > >>>> > >>>> I don't see how this patch has any additional negative impact on > >>>> memory migration for the case of malicious servers that the server > >>>> can't already (and more easily) do. In fact, this patchset if anything > >>>> helps memory given that malicious servers now can't also trigger page > >>>> allocations for temp pages that would never get freed. > >>>> > >>> > >>> If that's true, maybe we could drop this patch out of this patchset? So > >>> that both before and after this patchset, synchronous migration could be > >>> blocked by a malicious FUSE server, while the usability of continuous > >>> memory (CMA) won't be affected. > >> > >> I had exactly the same thought: if we can block forever on the folio > >> lock, there is no need for AS_WRITEBACK_INDETERMINATE. It's already all > >> completely broken. > > > > I will resubmit this patchset and drop this patch. > > > > I think we still need AS_WRITEBACK_INDETERMINATE for sync and legacy > > cgroupv1 reclaim scenarios: > > a) sync: sync waits on writeback so if we don't skip waiting on > > writeback for AS_WRITEBACK_INDETERMINATE mappings, then malicious fuse > > servers could make syncs hang. (There's no actual effect on sync > > behavior though with temp pages because even without temp pages, we > > return even though the data hasn't actually been synced to disk by the > > server yet) > > Just curious: Are we sure there are no other cases where a malicious > userspace could make some other folio_lock() hang forever either way? > Unfortunately, there's an awful case where kswapd may get blocked waiting for the folio lock. We encountered this in prod last week from a well-intentioned but incorrectly written FUSE server that got stuck. The stack trace was: 366 kswapd0 D folio_wait_bit_common.llvm.15141953522965195141 truncate_inode_pages_range fuse_evict_inode evict _dentry_kill shrink_dentry_list prune_dcache_sb super_cache_scan do_shrink_slab shrink_slab kswapd kthread ret_from_fork ret_from_fork_asm which was narrowed down to the __filemap_get_folio(..., FGP_LOCK, ...) call in truncate_inode_pages_range(). I'm working on a fix for this for kswapd and planning to also do a broader audit for other places where we might get tripped up from fuse forever holding a folio lock. I'm going to look more into the long-term fuse fix too - the first step will be documenting all the places currently where a lock may be forever held. > IOW, just like for migration, isn't this just solving one part of the > whole problem we are facing? For sync, I didn't see any folio lock acquires anywhere but I just noticed that fuse's .sync_fs() implementation will block until a server replies, so yes a malicious server could still hold up sync regardless of temp pages or not. I'll drop the sync patch too in v7. Thanks, Joanne > > > > > b) cgroupv1 reclaim: a correctly written fuse server can fall into > > this deadlock in one very specific scenario (eg if it's using legacy > > cgroupv1 and reclaim encounters a folio that already has the reclaim > > flag set and the caller didn't have __GFP_FS (or __GFP_IO if swap) > > set), where the deadlock is triggered by: > > * single-threaded FUSE server is in the middle of handling a request > > that needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback > > * the FUSE server can't write back the folio since it's stuck in direct reclaim > > Yes, that sounds reasonable. > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 124+ messages in thread
* [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong ` (3 preceding siblings ...) 2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong @ 2024-11-22 23:23 ` Joanne Koong 2024-11-25 9:46 ` Jingbo Xu 2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong 2024-12-13 11:52 ` Miklos Szeredi 6 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-11-22 23:23 UTC (permalink / raw) To: miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: support writable mmap")), a temp page is allocated for every dirty page to be written back, the contents of the dirty page are copied over to the temp page, and the temp page gets handed to the server to write back. This is done so that writeback may be immediately cleared on the dirty page, and this in turn is done for two reasons: a) in order to mitigate the following deadlock scenario that may arise if reclaim waits on writeback on the dirty page to complete: * single-threaded FUSE server is in the middle of handling a request that needs a memory allocation * memory allocation triggers direct reclaim * direct reclaim waits on a folio under writeback * the FUSE server can't write back the folio since it's stuck in direct reclaim b) in order to unblock internal (eg sync, page compaction) waits on writeback without needing the server to complete writing back to disk, which may take an indeterminate amount of time. With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates the situations described above, FUSE writeback does not need to use temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings and removes the temporary pages + extra copying and the internal rb tree. fio benchmarks -- (using averages observed from 10 runs, throwing away outliers) Setup: sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount bs = 1k 4k 1M Before 351 MiB/s 1818 MiB/s 1851 MiB/s After 341 MiB/s 2246 MiB/s 2685 MiB/s % diff -3% 23% 45% Signed-off-by: Joanne Koong <joannelkoong@gmail.com> --- fs/fuse/file.c | 360 ++++------------------------------------------- fs/fuse/fuse_i.h | 3 - 2 files changed, 28 insertions(+), 335 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 88d0946b5bc9..1970d1a699a6 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) struct fuse_writepage_args { struct fuse_io_args ia; - struct rb_node writepages_entry; struct list_head queue_entry; - struct fuse_writepage_args *next; struct inode *inode; struct fuse_sync_bucket *bucket; }; -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, - pgoff_t idx_from, pgoff_t idx_to) -{ - struct rb_node *n; - - n = fi->writepages.rb_node; - - while (n) { - struct fuse_writepage_args *wpa; - pgoff_t curr_index; - - wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry); - WARN_ON(get_fuse_inode(wpa->inode) != fi); - curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT; - if (idx_from >= curr_index + wpa->ia.ap.num_folios) - n = n->rb_right; - else if (idx_to < curr_index) - n = n->rb_left; - else - return wpa; - } - return NULL; -} - -/* - * Check if any page in a range is under writeback - */ -static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from, - pgoff_t idx_to) -{ - struct fuse_inode *fi = get_fuse_inode(inode); - bool found; - - if (RB_EMPTY_ROOT(&fi->writepages)) - return false; - - spin_lock(&fi->lock); - found = fuse_find_writeback(fi, idx_from, idx_to); - spin_unlock(&fi->lock); - - return found; -} - -static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index) -{ - return fuse_range_is_writeback(inode, index, index); -} - -/* - * Wait for page writeback to be completed. - * - * Since fuse doesn't rely on the VM writeback tracking, this has to - * use some other means. - */ -static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) -{ - struct fuse_inode *fi = get_fuse_inode(inode); - - wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index)); -} - -static inline bool fuse_folio_is_writeback(struct inode *inode, - struct folio *folio) -{ - pgoff_t last = folio_next_index(folio) - 1; - return fuse_range_is_writeback(inode, folio_index(folio), last); -} - -static void fuse_wait_on_folio_writeback(struct inode *inode, - struct folio *folio) -{ - struct fuse_inode *fi = get_fuse_inode(inode); - - wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio)); -} - /* * Wait for all pending writepages on the inode to finish. * @@ -886,13 +808,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio) ssize_t res; u64 attr_ver; - /* - * With the temporary pages that are used to complete writeback, we can - * have writeback that extends beyond the lifetime of the folio. So - * make sure we read a properly synced folio. - */ - fuse_wait_on_folio_writeback(inode, folio); - attr_ver = fuse_get_attr_version(fm->fc); /* Don't overflow end offset */ @@ -1003,17 +918,12 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file) static void fuse_readahead(struct readahead_control *rac) { struct inode *inode = rac->mapping->host; - struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_conn *fc = get_fuse_conn(inode); unsigned int max_pages, nr_pages; - pgoff_t first = readahead_index(rac); - pgoff_t last = first + readahead_count(rac) - 1; if (fuse_is_bad(inode)) return; - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last)); - max_pages = min_t(unsigned int, fc->max_pages, fc->max_read / PAGE_SIZE); @@ -1172,7 +1082,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia, int err; for (i = 0; i < ap->num_folios; i++) - fuse_wait_on_folio_writeback(inode, ap->folios[i]); + folio_wait_writeback(ap->folios[i]); fuse_write_args_fill(ia, ff, pos, count); ia->write.in.flags = fuse_write_flags(iocb); @@ -1622,7 +1532,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter, return res; } } - if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) { + if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) { if (!write) inode_lock(inode); fuse_sync_writes(inode); @@ -1819,38 +1729,34 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out, static void fuse_writepage_free(struct fuse_writepage_args *wpa) { struct fuse_args_pages *ap = &wpa->ia.ap; - int i; if (wpa->bucket) fuse_sync_bucket_dec(wpa->bucket); - for (i = 0; i < ap->num_folios; i++) - folio_put(ap->folios[i]); - fuse_file_put(wpa->ia.ff, false); kfree(ap->folios); kfree(wpa); } -static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio) -{ - struct backing_dev_info *bdi = inode_to_bdi(inode); - - dec_wb_stat(&bdi->wb, WB_WRITEBACK); - node_stat_sub_folio(folio, NR_WRITEBACK_TEMP); - wb_writeout_inc(&bdi->wb); -} - static void fuse_writepage_finish(struct fuse_writepage_args *wpa) { struct fuse_args_pages *ap = &wpa->ia.ap; struct inode *inode = wpa->inode; struct fuse_inode *fi = get_fuse_inode(inode); + struct backing_dev_info *bdi = inode_to_bdi(inode); int i; - for (i = 0; i < ap->num_folios; i++) - fuse_writepage_finish_stat(inode, ap->folios[i]); + for (i = 0; i < ap->num_folios; i++) { + /* + * Benchmarks showed that ending writeback within the + * scope of the fi->lock alleviates xarray lock + * contention and noticeably improves performance. + */ + folio_end_writeback(ap->folios[i]); + dec_wb_stat(&bdi->wb, WB_WRITEBACK); + wb_writeout_inc(&bdi->wb); + } wake_up(&fi->page_waitq); } @@ -1861,7 +1767,6 @@ static void fuse_send_writepage(struct fuse_mount *fm, __releases(fi->lock) __acquires(fi->lock) { - struct fuse_writepage_args *aux, *next; struct fuse_inode *fi = get_fuse_inode(wpa->inode); struct fuse_write_in *inarg = &wpa->ia.write.in; struct fuse_args *args = &wpa->ia.ap.args; @@ -1898,19 +1803,8 @@ __acquires(fi->lock) out_free: fi->writectr--; - rb_erase(&wpa->writepages_entry, &fi->writepages); fuse_writepage_finish(wpa); spin_unlock(&fi->lock); - - /* After rb_erase() aux request list is private */ - for (aux = wpa->next; aux; aux = next) { - next = aux->next; - aux->next = NULL; - fuse_writepage_finish_stat(aux->inode, - aux->ia.ap.folios[0]); - fuse_writepage_free(aux); - } - fuse_writepage_free(wpa); spin_lock(&fi->lock); } @@ -1938,43 +1832,6 @@ __acquires(fi->lock) } } -static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root, - struct fuse_writepage_args *wpa) -{ - pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT; - pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1; - struct rb_node **p = &root->rb_node; - struct rb_node *parent = NULL; - - WARN_ON(!wpa->ia.ap.num_folios); - while (*p) { - struct fuse_writepage_args *curr; - pgoff_t curr_index; - - parent = *p; - curr = rb_entry(parent, struct fuse_writepage_args, - writepages_entry); - WARN_ON(curr->inode != wpa->inode); - curr_index = curr->ia.write.in.offset >> PAGE_SHIFT; - - if (idx_from >= curr_index + curr->ia.ap.num_folios) - p = &(*p)->rb_right; - else if (idx_to < curr_index) - p = &(*p)->rb_left; - else - return curr; - } - - rb_link_node(&wpa->writepages_entry, parent, p); - rb_insert_color(&wpa->writepages_entry, root); - return NULL; -} - -static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa) -{ - WARN_ON(fuse_insert_writeback(root, wpa)); -} - static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args, int error) { @@ -1994,41 +1851,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args, if (!fc->writeback_cache) fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY); spin_lock(&fi->lock); - rb_erase(&wpa->writepages_entry, &fi->writepages); - while (wpa->next) { - struct fuse_mount *fm = get_fuse_mount(inode); - struct fuse_write_in *inarg = &wpa->ia.write.in; - struct fuse_writepage_args *next = wpa->next; - - wpa->next = next->next; - next->next = NULL; - tree_insert(&fi->writepages, next); - - /* - * Skip fuse_flush_writepages() to make it easy to crop requests - * based on primary request size. - * - * 1st case (trivial): there are no concurrent activities using - * fuse_set/release_nowrite. Then we're on safe side because - * fuse_flush_writepages() would call fuse_send_writepage() - * anyway. - * - * 2nd case: someone called fuse_set_nowrite and it is waiting - * now for completion of all in-flight requests. This happens - * rarely and no more than once per page, so this should be - * okay. - * - * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle - * of fuse_set_nowrite..fuse_release_nowrite section. The fact - * that fuse_set_nowrite returned implies that all in-flight - * requests were completed along with all of their secondary - * requests. Further primary requests are blocked by negative - * writectr. Hence there cannot be any in-flight requests and - * no invocations of fuse_writepage_end() while we're in - * fuse_set_nowrite..fuse_release_nowrite section. - */ - fuse_send_writepage(fm, next, inarg->offset + inarg->size); - } fi->writectr--; fuse_writepage_finish(wpa); spin_unlock(&fi->lock); @@ -2115,19 +1937,16 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc, } static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio, - struct folio *tmp_folio, uint32_t folio_index) + uint32_t folio_index) { struct inode *inode = folio->mapping->host; struct fuse_args_pages *ap = &wpa->ia.ap; - folio_copy(tmp_folio, folio); - - ap->folios[folio_index] = tmp_folio; + ap->folios[folio_index] = folio; ap->descs[folio_index].offset = 0; ap->descs[folio_index].length = PAGE_SIZE; inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); - node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP); } static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio, @@ -2162,18 +1981,12 @@ static int fuse_writepage_locked(struct folio *folio) struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_writepage_args *wpa; struct fuse_args_pages *ap; - struct folio *tmp_folio; struct fuse_file *ff; - int error = -ENOMEM; + int error = -EIO; - tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0); - if (!tmp_folio) - goto err; - - error = -EIO; ff = fuse_write_file_get(fi); if (!ff) - goto err_nofile; + goto err; wpa = fuse_writepage_args_setup(folio, ff); error = -ENOMEM; @@ -2184,22 +1997,17 @@ static int fuse_writepage_locked(struct folio *folio) ap->num_folios = 1; folio_start_writeback(folio); - fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0); + fuse_writepage_args_page_fill(wpa, folio, 0); spin_lock(&fi->lock); - tree_insert(&fi->writepages, wpa); list_add_tail(&wpa->queue_entry, &fi->queued_writes); fuse_flush_writepages(inode); spin_unlock(&fi->lock); - folio_end_writeback(folio); - return 0; err_writepage_args: fuse_file_put(ff, false); -err_nofile: - folio_put(tmp_folio); err: mapping_set_error(folio->mapping, error); return error; @@ -2209,7 +2017,6 @@ struct fuse_fill_wb_data { struct fuse_writepage_args *wpa; struct fuse_file *ff; struct inode *inode; - struct folio **orig_folios; unsigned int max_folios; }; @@ -2244,69 +2051,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data) struct fuse_writepage_args *wpa = data->wpa; struct inode *inode = data->inode; struct fuse_inode *fi = get_fuse_inode(inode); - int num_folios = wpa->ia.ap.num_folios; - int i; spin_lock(&fi->lock); list_add_tail(&wpa->queue_entry, &fi->queued_writes); fuse_flush_writepages(inode); spin_unlock(&fi->lock); - - for (i = 0; i < num_folios; i++) - folio_end_writeback(data->orig_folios[i]); -} - -/* - * Check under fi->lock if the page is under writeback, and insert it onto the - * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's - * one already added for a page at this offset. If there's none, then insert - * this new request onto the auxiliary list, otherwise reuse the existing one by - * swapping the new temp page with the old one. - */ -static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa, - struct folio *folio) -{ - struct fuse_inode *fi = get_fuse_inode(new_wpa->inode); - struct fuse_writepage_args *tmp; - struct fuse_writepage_args *old_wpa; - struct fuse_args_pages *new_ap = &new_wpa->ia.ap; - - WARN_ON(new_ap->num_folios != 0); - new_ap->num_folios = 1; - - spin_lock(&fi->lock); - old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa); - if (!old_wpa) { - spin_unlock(&fi->lock); - return true; - } - - for (tmp = old_wpa->next; tmp; tmp = tmp->next) { - pgoff_t curr_index; - - WARN_ON(tmp->inode != new_wpa->inode); - curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT; - if (curr_index == folio->index) { - WARN_ON(tmp->ia.ap.num_folios != 1); - swap(tmp->ia.ap.folios[0], new_ap->folios[0]); - break; - } - } - - if (!tmp) { - new_wpa->next = old_wpa->next; - old_wpa->next = new_wpa; - } - - spin_unlock(&fi->lock); - - if (tmp) { - fuse_writepage_finish_stat(new_wpa->inode, - folio); - fuse_writepage_free(new_wpa); - } - - return false; } static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio, @@ -2315,15 +2064,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio, { WARN_ON(!ap->num_folios); - /* - * Being under writeback is unlikely but possible. For example direct - * read to an mmaped fuse file will set the page dirty twice; once when - * the pages are faulted with get_user_pages(), and then after the read - * completed. - */ - if (fuse_folio_is_writeback(data->inode, folio)) - return true; - /* Reached max pages */ if (ap->num_folios == fc->max_pages) return true; @@ -2333,7 +2073,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio, return true; /* Discontinuity */ - if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio)) + if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio)) return true; /* Need to grow the pages array? If so, did the expansion fail? */ @@ -2352,7 +2092,6 @@ static int fuse_writepages_fill(struct folio *folio, struct inode *inode = data->inode; struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_conn *fc = get_fuse_conn(inode); - struct folio *tmp_folio; int err; if (!data->ff) { @@ -2367,54 +2106,23 @@ static int fuse_writepages_fill(struct folio *folio, data->wpa = NULL; } - err = -ENOMEM; - tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0); - if (!tmp_folio) - goto out_unlock; - - /* - * The page must not be redirtied until the writeout is completed - * (i.e. userspace has sent a reply to the write request). Otherwise - * there could be more than one temporary page instance for each real - * page. - * - * This is ensured by holding the page lock in page_mkwrite() while - * checking fuse_page_is_writeback(). We already hold the page lock - * since clear_page_dirty_for_io() and keep it held until we add the - * request to the fi->writepages list and increment ap->num_folios. - * After this fuse_page_is_writeback() will indicate that the page is - * under writeback, so we can release the page lock. - */ if (data->wpa == NULL) { err = -ENOMEM; wpa = fuse_writepage_args_setup(folio, data->ff); - if (!wpa) { - folio_put(tmp_folio); + if (!wpa) goto out_unlock; - } fuse_file_get(wpa->ia.ff); data->max_folios = 1; ap = &wpa->ia.ap; } folio_start_writeback(folio); - fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios); - data->orig_folios[ap->num_folios] = folio; + fuse_writepage_args_page_fill(wpa, folio, ap->num_folios); err = 0; - if (data->wpa) { - /* - * Protected by fi->lock against concurrent access by - * fuse_page_is_writeback(). - */ - spin_lock(&fi->lock); - ap->num_folios++; - spin_unlock(&fi->lock); - } else if (fuse_writepage_add(wpa, folio)) { + ap->num_folios++; + if (!data->wpa) data->wpa = wpa; - } else { - folio_end_writeback(folio); - } out_unlock: folio_unlock(folio); @@ -2441,13 +2149,6 @@ static int fuse_writepages(struct address_space *mapping, data.wpa = NULL; data.ff = NULL; - err = -ENOMEM; - data.orig_folios = kcalloc(fc->max_pages, - sizeof(struct folio *), - GFP_NOFS); - if (!data.orig_folios) - goto out; - err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data); if (data.wpa) { WARN_ON(!data.wpa->ia.ap.num_folios); @@ -2456,7 +2157,6 @@ static int fuse_writepages(struct address_space *mapping, if (data.ff) fuse_file_put(data.ff, false); - kfree(data.orig_folios); out: return err; } @@ -2481,8 +2181,6 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping, if (IS_ERR(folio)) goto error; - fuse_wait_on_page_writeback(mapping->host, folio->index); - if (folio_test_uptodate(folio) || len >= folio_size(folio)) goto success; /* @@ -2545,13 +2243,9 @@ static int fuse_launder_folio(struct folio *folio) { int err = 0; if (folio_clear_dirty_for_io(folio)) { - struct inode *inode = folio->mapping->host; - - /* Serialize with pending writeback for the same page */ - fuse_wait_on_page_writeback(inode, folio->index); err = fuse_writepage_locked(folio); if (!err) - fuse_wait_on_page_writeback(inode, folio->index); + folio_wait_writeback(folio); } return err; } @@ -2595,7 +2289,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf) return VM_FAULT_NOPAGE; } - fuse_wait_on_folio_writeback(inode, folio); + folio_wait_writeback(folio); return VM_FAULT_LOCKED; } @@ -3413,9 +3107,12 @@ static const struct address_space_operations fuse_file_aops = { void fuse_init_file_inode(struct inode *inode, unsigned int flags) { struct fuse_inode *fi = get_fuse_inode(inode); + struct fuse_conn *fc = get_fuse_conn(inode); inode->i_fop = &fuse_file_operations; inode->i_data.a_ops = &fuse_file_aops; + if (fc->writeback_cache) + mapping_set_writeback_indeterminate(&inode->i_data); INIT_LIST_HEAD(&fi->write_files); INIT_LIST_HEAD(&fi->queued_writes); @@ -3423,7 +3120,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) fi->iocachectr = 0; init_waitqueue_head(&fi->page_waitq); init_waitqueue_head(&fi->direct_io_waitq); - fi->writepages = RB_ROOT; if (IS_ENABLED(CONFIG_FUSE_DAX)) fuse_dax_inode_init(inode, flags); diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 74744c6f2860..23736c5c64c1 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -141,9 +141,6 @@ struct fuse_inode { /* waitq for direct-io completion */ wait_queue_head_t direct_io_waitq; - - /* List of writepage requestst (pending or sent) */ - struct rb_root writepages; }; /* readdir cache (directory only) */ -- 2.43.5 ^ permalink raw reply related [flat|nested] 124+ messages in thread
* Re: [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree 2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong @ 2024-11-25 9:46 ` Jingbo Xu 0 siblings, 0 replies; 124+ messages in thread From: Jingbo Xu @ 2024-11-25 9:46 UTC (permalink / raw) To: Joanne Koong, miklos, linux-fsdevel Cc: shakeel.butt, josef, bernd.schubert, linux-mm, kernel-team On 11/23/24 7:23 AM, Joanne Koong wrote: > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap")), a temp page is allocated for every > dirty page to be written back, the contents of the dirty page are copied over > to the temp page, and the temp page gets handed to the server to write back. > > This is done so that writeback may be immediately cleared on the dirty page, > and this in turn is done for two reasons: > a) in order to mitigate the following deadlock scenario that may arise > if reclaim waits on writeback on the dirty page to complete: > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > b) in order to unblock internal (eg sync, page compaction) waits on > writeback without needing the server to complete writing back to disk, > which may take an indeterminate amount of time. > > With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates > the situations described above, FUSE writeback does not need to use > temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. > > This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > and removes the temporary pages + extra copying and the internal rb > tree. > > fio benchmarks -- > (using averages observed from 10 runs, throwing away outliers) > > Setup: > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > > bs = 1k 4k 1M > Before 351 MiB/s 1818 MiB/s 1851 MiB/s > After 341 MiB/s 2246 MiB/s 2685 MiB/s > % diff -3% 23% 45% > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com> LGTM. Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> > --- > fs/fuse/file.c | 360 ++++------------------------------------------- > fs/fuse/fuse_i.h | 3 - > 2 files changed, 28 insertions(+), 335 deletions(-) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 88d0946b5bc9..1970d1a699a6 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) > > struct fuse_writepage_args { > struct fuse_io_args ia; > - struct rb_node writepages_entry; > struct list_head queue_entry; > - struct fuse_writepage_args *next; > struct inode *inode; > struct fuse_sync_bucket *bucket; > }; > > -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, > - pgoff_t idx_from, pgoff_t idx_to) > -{ > - struct rb_node *n; > - > - n = fi->writepages.rb_node; > - > - while (n) { > - struct fuse_writepage_args *wpa; > - pgoff_t curr_index; > - > - wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry); > - WARN_ON(get_fuse_inode(wpa->inode) != fi); > - curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT; > - if (idx_from >= curr_index + wpa->ia.ap.num_folios) > - n = n->rb_right; > - else if (idx_to < curr_index) > - n = n->rb_left; > - else > - return wpa; > - } > - return NULL; > -} > - > -/* > - * Check if any page in a range is under writeback > - */ > -static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from, > - pgoff_t idx_to) > -{ > - struct fuse_inode *fi = get_fuse_inode(inode); > - bool found; > - > - if (RB_EMPTY_ROOT(&fi->writepages)) > - return false; > - > - spin_lock(&fi->lock); > - found = fuse_find_writeback(fi, idx_from, idx_to); > - spin_unlock(&fi->lock); > - > - return found; > -} > - > -static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index) > -{ > - return fuse_range_is_writeback(inode, index, index); > -} > - > -/* > - * Wait for page writeback to be completed. > - * > - * Since fuse doesn't rely on the VM writeback tracking, this has to > - * use some other means. > - */ > -static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) > -{ > - struct fuse_inode *fi = get_fuse_inode(inode); > - > - wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index)); > -} > - > -static inline bool fuse_folio_is_writeback(struct inode *inode, > - struct folio *folio) > -{ > - pgoff_t last = folio_next_index(folio) - 1; > - return fuse_range_is_writeback(inode, folio_index(folio), last); > -} > - > -static void fuse_wait_on_folio_writeback(struct inode *inode, > - struct folio *folio) > -{ > - struct fuse_inode *fi = get_fuse_inode(inode); > - > - wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio)); > -} > - > /* > * Wait for all pending writepages on the inode to finish. > * > @@ -886,13 +808,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio) > ssize_t res; > u64 attr_ver; > > - /* > - * With the temporary pages that are used to complete writeback, we can > - * have writeback that extends beyond the lifetime of the folio. So > - * make sure we read a properly synced folio. > - */ > - fuse_wait_on_folio_writeback(inode, folio); > - > attr_ver = fuse_get_attr_version(fm->fc); > > /* Don't overflow end offset */ > @@ -1003,17 +918,12 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file) > static void fuse_readahead(struct readahead_control *rac) > { > struct inode *inode = rac->mapping->host; > - struct fuse_inode *fi = get_fuse_inode(inode); > struct fuse_conn *fc = get_fuse_conn(inode); > unsigned int max_pages, nr_pages; > - pgoff_t first = readahead_index(rac); > - pgoff_t last = first + readahead_count(rac) - 1; > > if (fuse_is_bad(inode)) > return; > > - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last)); > - > max_pages = min_t(unsigned int, fc->max_pages, > fc->max_read / PAGE_SIZE); > > @@ -1172,7 +1082,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia, > int err; > > for (i = 0; i < ap->num_folios; i++) > - fuse_wait_on_folio_writeback(inode, ap->folios[i]); > + folio_wait_writeback(ap->folios[i]); > > fuse_write_args_fill(ia, ff, pos, count); > ia->write.in.flags = fuse_write_flags(iocb); > @@ -1622,7 +1532,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter, > return res; > } > } > - if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) { > + if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) { > if (!write) > inode_lock(inode); > fuse_sync_writes(inode); > @@ -1819,38 +1729,34 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out, > static void fuse_writepage_free(struct fuse_writepage_args *wpa) > { > struct fuse_args_pages *ap = &wpa->ia.ap; > - int i; > > if (wpa->bucket) > fuse_sync_bucket_dec(wpa->bucket); > > - for (i = 0; i < ap->num_folios; i++) > - folio_put(ap->folios[i]); > - > fuse_file_put(wpa->ia.ff, false); > > kfree(ap->folios); > kfree(wpa); > } > > -static void fuse_writepage_finish_stat(struct inode *inode, struct folio *folio) > -{ > - struct backing_dev_info *bdi = inode_to_bdi(inode); > - > - dec_wb_stat(&bdi->wb, WB_WRITEBACK); > - node_stat_sub_folio(folio, NR_WRITEBACK_TEMP); > - wb_writeout_inc(&bdi->wb); > -} > - > static void fuse_writepage_finish(struct fuse_writepage_args *wpa) > { > struct fuse_args_pages *ap = &wpa->ia.ap; > struct inode *inode = wpa->inode; > struct fuse_inode *fi = get_fuse_inode(inode); > + struct backing_dev_info *bdi = inode_to_bdi(inode); > int i; > > - for (i = 0; i < ap->num_folios; i++) > - fuse_writepage_finish_stat(inode, ap->folios[i]); > + for (i = 0; i < ap->num_folios; i++) { > + /* > + * Benchmarks showed that ending writeback within the > + * scope of the fi->lock alleviates xarray lock > + * contention and noticeably improves performance. > + */ > + folio_end_writeback(ap->folios[i]); > + dec_wb_stat(&bdi->wb, WB_WRITEBACK); > + wb_writeout_inc(&bdi->wb); > + } > > wake_up(&fi->page_waitq); > } > @@ -1861,7 +1767,6 @@ static void fuse_send_writepage(struct fuse_mount *fm, > __releases(fi->lock) > __acquires(fi->lock) > { > - struct fuse_writepage_args *aux, *next; > struct fuse_inode *fi = get_fuse_inode(wpa->inode); > struct fuse_write_in *inarg = &wpa->ia.write.in; > struct fuse_args *args = &wpa->ia.ap.args; > @@ -1898,19 +1803,8 @@ __acquires(fi->lock) > > out_free: > fi->writectr--; > - rb_erase(&wpa->writepages_entry, &fi->writepages); > fuse_writepage_finish(wpa); > spin_unlock(&fi->lock); > - > - /* After rb_erase() aux request list is private */ > - for (aux = wpa->next; aux; aux = next) { > - next = aux->next; > - aux->next = NULL; > - fuse_writepage_finish_stat(aux->inode, > - aux->ia.ap.folios[0]); > - fuse_writepage_free(aux); > - } > - > fuse_writepage_free(wpa); > spin_lock(&fi->lock); > } > @@ -1938,43 +1832,6 @@ __acquires(fi->lock) > } > } > > -static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root, > - struct fuse_writepage_args *wpa) > -{ > - pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT; > - pgoff_t idx_to = idx_from + wpa->ia.ap.num_folios - 1; > - struct rb_node **p = &root->rb_node; > - struct rb_node *parent = NULL; > - > - WARN_ON(!wpa->ia.ap.num_folios); > - while (*p) { > - struct fuse_writepage_args *curr; > - pgoff_t curr_index; > - > - parent = *p; > - curr = rb_entry(parent, struct fuse_writepage_args, > - writepages_entry); > - WARN_ON(curr->inode != wpa->inode); > - curr_index = curr->ia.write.in.offset >> PAGE_SHIFT; > - > - if (idx_from >= curr_index + curr->ia.ap.num_folios) > - p = &(*p)->rb_right; > - else if (idx_to < curr_index) > - p = &(*p)->rb_left; > - else > - return curr; > - } > - > - rb_link_node(&wpa->writepages_entry, parent, p); > - rb_insert_color(&wpa->writepages_entry, root); > - return NULL; > -} > - > -static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa) > -{ > - WARN_ON(fuse_insert_writeback(root, wpa)); > -} > - > static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args, > int error) > { > @@ -1994,41 +1851,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args, > if (!fc->writeback_cache) > fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY); > spin_lock(&fi->lock); > - rb_erase(&wpa->writepages_entry, &fi->writepages); > - while (wpa->next) { > - struct fuse_mount *fm = get_fuse_mount(inode); > - struct fuse_write_in *inarg = &wpa->ia.write.in; > - struct fuse_writepage_args *next = wpa->next; > - > - wpa->next = next->next; > - next->next = NULL; > - tree_insert(&fi->writepages, next); > - > - /* > - * Skip fuse_flush_writepages() to make it easy to crop requests > - * based on primary request size. > - * > - * 1st case (trivial): there are no concurrent activities using > - * fuse_set/release_nowrite. Then we're on safe side because > - * fuse_flush_writepages() would call fuse_send_writepage() > - * anyway. > - * > - * 2nd case: someone called fuse_set_nowrite and it is waiting > - * now for completion of all in-flight requests. This happens > - * rarely and no more than once per page, so this should be > - * okay. > - * > - * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle > - * of fuse_set_nowrite..fuse_release_nowrite section. The fact > - * that fuse_set_nowrite returned implies that all in-flight > - * requests were completed along with all of their secondary > - * requests. Further primary requests are blocked by negative > - * writectr. Hence there cannot be any in-flight requests and > - * no invocations of fuse_writepage_end() while we're in > - * fuse_set_nowrite..fuse_release_nowrite section. > - */ > - fuse_send_writepage(fm, next, inarg->offset + inarg->size); > - } > fi->writectr--; > fuse_writepage_finish(wpa); > spin_unlock(&fi->lock); > @@ -2115,19 +1937,16 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc, > } > > static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio, > - struct folio *tmp_folio, uint32_t folio_index) > + uint32_t folio_index) > { > struct inode *inode = folio->mapping->host; > struct fuse_args_pages *ap = &wpa->ia.ap; > > - folio_copy(tmp_folio, folio); > - > - ap->folios[folio_index] = tmp_folio; > + ap->folios[folio_index] = folio; > ap->descs[folio_index].offset = 0; > ap->descs[folio_index].length = PAGE_SIZE; > > inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); > - node_stat_add_folio(tmp_folio, NR_WRITEBACK_TEMP); > } > > static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio, > @@ -2162,18 +1981,12 @@ static int fuse_writepage_locked(struct folio *folio) > struct fuse_inode *fi = get_fuse_inode(inode); > struct fuse_writepage_args *wpa; > struct fuse_args_pages *ap; > - struct folio *tmp_folio; > struct fuse_file *ff; > - int error = -ENOMEM; > + int error = -EIO; > > - tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0); > - if (!tmp_folio) > - goto err; > - > - error = -EIO; > ff = fuse_write_file_get(fi); > if (!ff) > - goto err_nofile; > + goto err; > > wpa = fuse_writepage_args_setup(folio, ff); > error = -ENOMEM; > @@ -2184,22 +1997,17 @@ static int fuse_writepage_locked(struct folio *folio) > ap->num_folios = 1; > > folio_start_writeback(folio); > - fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0); > + fuse_writepage_args_page_fill(wpa, folio, 0); > > spin_lock(&fi->lock); > - tree_insert(&fi->writepages, wpa); > list_add_tail(&wpa->queue_entry, &fi->queued_writes); > fuse_flush_writepages(inode); > spin_unlock(&fi->lock); > > - folio_end_writeback(folio); > - > return 0; > > err_writepage_args: > fuse_file_put(ff, false); > -err_nofile: > - folio_put(tmp_folio); > err: > mapping_set_error(folio->mapping, error); > return error; > @@ -2209,7 +2017,6 @@ struct fuse_fill_wb_data { > struct fuse_writepage_args *wpa; > struct fuse_file *ff; > struct inode *inode; > - struct folio **orig_folios; > unsigned int max_folios; > }; > > @@ -2244,69 +2051,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data) > struct fuse_writepage_args *wpa = data->wpa; > struct inode *inode = data->inode; > struct fuse_inode *fi = get_fuse_inode(inode); > - int num_folios = wpa->ia.ap.num_folios; > - int i; > > spin_lock(&fi->lock); > list_add_tail(&wpa->queue_entry, &fi->queued_writes); > fuse_flush_writepages(inode); > spin_unlock(&fi->lock); > - > - for (i = 0; i < num_folios; i++) > - folio_end_writeback(data->orig_folios[i]); > -} > - > -/* > - * Check under fi->lock if the page is under writeback, and insert it onto the > - * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's > - * one already added for a page at this offset. If there's none, then insert > - * this new request onto the auxiliary list, otherwise reuse the existing one by > - * swapping the new temp page with the old one. > - */ > -static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa, > - struct folio *folio) > -{ > - struct fuse_inode *fi = get_fuse_inode(new_wpa->inode); > - struct fuse_writepage_args *tmp; > - struct fuse_writepage_args *old_wpa; > - struct fuse_args_pages *new_ap = &new_wpa->ia.ap; > - > - WARN_ON(new_ap->num_folios != 0); > - new_ap->num_folios = 1; > - > - spin_lock(&fi->lock); > - old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa); > - if (!old_wpa) { > - spin_unlock(&fi->lock); > - return true; > - } > - > - for (tmp = old_wpa->next; tmp; tmp = tmp->next) { > - pgoff_t curr_index; > - > - WARN_ON(tmp->inode != new_wpa->inode); > - curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT; > - if (curr_index == folio->index) { > - WARN_ON(tmp->ia.ap.num_folios != 1); > - swap(tmp->ia.ap.folios[0], new_ap->folios[0]); > - break; > - } > - } > - > - if (!tmp) { > - new_wpa->next = old_wpa->next; > - old_wpa->next = new_wpa; > - } > - > - spin_unlock(&fi->lock); > - > - if (tmp) { > - fuse_writepage_finish_stat(new_wpa->inode, > - folio); > - fuse_writepage_free(new_wpa); > - } > - > - return false; > } > > static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio, > @@ -2315,15 +2064,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio, > { > WARN_ON(!ap->num_folios); > > - /* > - * Being under writeback is unlikely but possible. For example direct > - * read to an mmaped fuse file will set the page dirty twice; once when > - * the pages are faulted with get_user_pages(), and then after the read > - * completed. > - */ > - if (fuse_folio_is_writeback(data->inode, folio)) > - return true; > - > /* Reached max pages */ > if (ap->num_folios == fc->max_pages) > return true; > @@ -2333,7 +2073,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio, > return true; > > /* Discontinuity */ > - if (data->orig_folios[ap->num_folios - 1]->index + 1 != folio_index(folio)) > + if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio)) > return true; > > /* Need to grow the pages array? If so, did the expansion fail? */ > @@ -2352,7 +2092,6 @@ static int fuse_writepages_fill(struct folio *folio, > struct inode *inode = data->inode; > struct fuse_inode *fi = get_fuse_inode(inode); > struct fuse_conn *fc = get_fuse_conn(inode); > - struct folio *tmp_folio; > int err; > > if (!data->ff) { > @@ -2367,54 +2106,23 @@ static int fuse_writepages_fill(struct folio *folio, > data->wpa = NULL; > } > > - err = -ENOMEM; > - tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0); > - if (!tmp_folio) > - goto out_unlock; > - > - /* > - * The page must not be redirtied until the writeout is completed > - * (i.e. userspace has sent a reply to the write request). Otherwise > - * there could be more than one temporary page instance for each real > - * page. > - * > - * This is ensured by holding the page lock in page_mkwrite() while > - * checking fuse_page_is_writeback(). We already hold the page lock > - * since clear_page_dirty_for_io() and keep it held until we add the > - * request to the fi->writepages list and increment ap->num_folios. > - * After this fuse_page_is_writeback() will indicate that the page is > - * under writeback, so we can release the page lock. > - */ > if (data->wpa == NULL) { > err = -ENOMEM; > wpa = fuse_writepage_args_setup(folio, data->ff); > - if (!wpa) { > - folio_put(tmp_folio); > + if (!wpa) > goto out_unlock; > - } > fuse_file_get(wpa->ia.ff); > data->max_folios = 1; > ap = &wpa->ia.ap; > } > folio_start_writeback(folio); > > - fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_folios); > - data->orig_folios[ap->num_folios] = folio; > + fuse_writepage_args_page_fill(wpa, folio, ap->num_folios); > > err = 0; > - if (data->wpa) { > - /* > - * Protected by fi->lock against concurrent access by > - * fuse_page_is_writeback(). > - */ > - spin_lock(&fi->lock); > - ap->num_folios++; > - spin_unlock(&fi->lock); > - } else if (fuse_writepage_add(wpa, folio)) { > + ap->num_folios++; > + if (!data->wpa) > data->wpa = wpa; > - } else { > - folio_end_writeback(folio); > - } > out_unlock: > folio_unlock(folio); > > @@ -2441,13 +2149,6 @@ static int fuse_writepages(struct address_space *mapping, > data.wpa = NULL; > data.ff = NULL; > > - err = -ENOMEM; > - data.orig_folios = kcalloc(fc->max_pages, > - sizeof(struct folio *), > - GFP_NOFS); > - if (!data.orig_folios) > - goto out; > - > err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data); > if (data.wpa) { > WARN_ON(!data.wpa->ia.ap.num_folios); > @@ -2456,7 +2157,6 @@ static int fuse_writepages(struct address_space *mapping, > if (data.ff) > fuse_file_put(data.ff, false); > > - kfree(data.orig_folios); > out: > return err; > } > @@ -2481,8 +2181,6 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping, > if (IS_ERR(folio)) > goto error; > > - fuse_wait_on_page_writeback(mapping->host, folio->index); > - > if (folio_test_uptodate(folio) || len >= folio_size(folio)) > goto success; > /* > @@ -2545,13 +2243,9 @@ static int fuse_launder_folio(struct folio *folio) > { > int err = 0; > if (folio_clear_dirty_for_io(folio)) { > - struct inode *inode = folio->mapping->host; > - > - /* Serialize with pending writeback for the same page */ > - fuse_wait_on_page_writeback(inode, folio->index); > err = fuse_writepage_locked(folio); > if (!err) > - fuse_wait_on_page_writeback(inode, folio->index); > + folio_wait_writeback(folio); > } > return err; > } > @@ -2595,7 +2289,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf) > return VM_FAULT_NOPAGE; > } > > - fuse_wait_on_folio_writeback(inode, folio); > + folio_wait_writeback(folio); > return VM_FAULT_LOCKED; > } > > @@ -3413,9 +3107,12 @@ static const struct address_space_operations fuse_file_aops = { > void fuse_init_file_inode(struct inode *inode, unsigned int flags) > { > struct fuse_inode *fi = get_fuse_inode(inode); > + struct fuse_conn *fc = get_fuse_conn(inode); > > inode->i_fop = &fuse_file_operations; > inode->i_data.a_ops = &fuse_file_aops; > + if (fc->writeback_cache) > + mapping_set_writeback_indeterminate(&inode->i_data); > > INIT_LIST_HEAD(&fi->write_files); > INIT_LIST_HEAD(&fi->queued_writes); > @@ -3423,7 +3120,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) > fi->iocachectr = 0; > init_waitqueue_head(&fi->page_waitq); > init_waitqueue_head(&fi->direct_io_waitq); > - fi->writepages = RB_ROOT; > > if (IS_ENABLED(CONFIG_FUSE_DAX)) > fuse_dax_inode_init(inode, flags); > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h > index 74744c6f2860..23736c5c64c1 100644 > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -141,9 +141,6 @@ struct fuse_inode { > > /* waitq for direct-io completion */ > wait_queue_head_t direct_io_waitq; > - > - /* List of writepage requestst (pending or sent) */ > - struct rb_root writepages; > }; > > /* readdir cache (directory only) */ -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong ` (4 preceding siblings ...) 2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong @ 2024-12-12 21:55 ` Joanne Koong 2024-12-13 11:52 ` Miklos Szeredi 6 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2024-12-12 21:55 UTC (permalink / raw) To: miklos, linux-fsdevel Cc: shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team On Fri, Nov 22, 2024 at 3:24 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > The purpose of this patchset is to help make writeback-cache write > performance in FUSE filesystems as fast as possible. > > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > page to be written back, the contents of the dirty page are copied over to the > temp page, and the temp page gets handed to the server to write back. This is > done so that writeback may be immediately cleared on the dirty page, and this > in turn is done for two reasons: > a) in order to mitigate the following deadlock scenario that may arise if > reclaim waits on writeback on the dirty page to complete (more details can be > found in this thread [1]): > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > b) in order to unblock internal (eg sync, page compaction) waits on writeback > without needing the server to complete writing back to disk, which may take > an indeterminate amount of time. > > Allocating and copying dirty pages to temp pages is the biggest performance > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > altogether (which will also allow us to get rid of the internal FUSE rb tree > that is needed to keep track of writeback status on the temp pages). > Benchmarks show approximately a 20% improvement in throughput for 4k > block-size writes and a 45% improvement for 1M block-size writes. > > With removing the temp page, writeback state is now only cleared on the dirty > page after the server has written it back to disk. This may take an > indeterminate amount of time. As well, there is also the possibility of > malicious or well-intentioned but buggy servers where writeback may in the > worst case scenario, never complete. This means that any > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to > be carefully audited. > > In particular, these are the cases that need to be accounted for: > * potentially deadlocking in reclaim, as mentioned above > * potentially stalling sync(2) > * potentially stalling page migration / compaction > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which > filesystems may set on its inode mappings to indicate that writeback > operations may take an indeterminate amount of time to complete. FUSE will set > this flag on its mappings. This patchset adds checks to the critical parts of > reclaim, sync, and page migration logic where writeback may be waited on. > > Please note the following: > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no > effect on existing behavior. Dirty FUSE pages are already not guaranteed to > be written to disk by the time sync(2) returns (eg writeback is cleared on > the dirty page but the server may not have written out the temp page to disk > yet). If the caller wishes to ensure the data has actually been synced to > disk, they should use fsync(2)/fdatasync(2) instead. > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be > waited on when in writeback. There are some cases where the wait is > desirable. For example, for the sync_file_range() syscall, it is fine to > wait on the writeback since the caller passes in a fd for the operation. > > [1] > https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/ > > Changelog > --------- > v5: > https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/ > Changes from v5 -> v6: > * Add Shakeel and Jingbo's reviewed-bys > * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo) > * Embed fuse_writepage_finish_stat() logic inline (Jingbo) > * Remove node_stat NR_WRITEBACK inc/sub (Jingbo) > > v4: > https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/ > Changes from v4 -> v5: > * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel) > * Drop memory hotplug patch (David and Shakeel) > * Remove some more kunnecessary writeback waits in fuse code (Jingbo) > * Make commit message for reclaim patch more concise - drop part about > deadlock and just focus on how it may stall waits > > v3: > https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/ > Changes from v3 -> v4: > * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in > readahead > > v2: > https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/ > Changes from v2 -> v3: > * Account for sync and page migration cases as well (Miklos) > * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK > * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache > is enabled > > v1: > https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t > Changes from v1 -> v2: > * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel) > * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel) > > Joanne Koong (5): > mm: add AS_WRITEBACK_INDETERMINATE mapping flag > mm: skip reclaiming folios in legacy memcg writeback indeterminate > contexts > fs/writeback: in wait_sb_inodes(), skip wait for > AS_WRITEBACK_INDETERMINATE mappings > mm/migrate: skip migrating folios under writeback with > AS_WRITEBACK_INDETERMINATE mappings > fuse: remove tmp folio for writebacks and internal rb tree > > fs/fs-writeback.c | 3 + > fs/fuse/file.c | 360 ++++------------------------------------ > fs/fuse/fuse_i.h | 3 - > include/linux/pagemap.h | 11 ++ > mm/migrate.c | 5 +- > mm/vmscan.c | 10 +- > 6 files changed, 53 insertions(+), 339 deletions(-) > Miklos, may I get your thoughts on this patchset? Thanks, Joanne > -- > 2.43.5 > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong ` (5 preceding siblings ...) 2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong @ 2024-12-13 11:52 ` Miklos Szeredi 2024-12-13 16:47 ` Shakeel Butt 6 siblings, 1 reply; 124+ messages in thread From: Miklos Szeredi @ 2024-12-13 11:52 UTC (permalink / raw) To: Joanne Koong Cc: linux-fsdevel, shakeel.butt, jefflexu, josef, bernd.schubert, linux-mm, kernel-team On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@gmail.com> wrote: > > The purpose of this patchset is to help make writeback-cache write > performance in FUSE filesystems as fast as possible. > > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > page to be written back, the contents of the dirty page are copied over to the > temp page, and the temp page gets handed to the server to write back. This is > done so that writeback may be immediately cleared on the dirty page, and this > in turn is done for two reasons: > a) in order to mitigate the following deadlock scenario that may arise if > reclaim waits on writeback on the dirty page to complete (more details can be > found in this thread [1]): > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > b) in order to unblock internal (eg sync, page compaction) waits on writeback > without needing the server to complete writing back to disk, which may take > an indeterminate amount of time. > > Allocating and copying dirty pages to temp pages is the biggest performance > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > altogether (which will also allow us to get rid of the internal FUSE rb tree > that is needed to keep track of writeback status on the temp pages). > Benchmarks show approximately a 20% improvement in throughput for 4k > block-size writes and a 45% improvement for 1M block-size writes. > > With removing the temp page, writeback state is now only cleared on the dirty > page after the server has written it back to disk. This may take an > indeterminate amount of time. As well, there is also the possibility of > malicious or well-intentioned but buggy servers where writeback may in the > worst case scenario, never complete. This means that any > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to > be carefully audited. > > In particular, these are the cases that need to be accounted for: > * potentially deadlocking in reclaim, as mentioned above > * potentially stalling sync(2) > * potentially stalling page migration / compaction > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which > filesystems may set on its inode mappings to indicate that writeback > operations may take an indeterminate amount of time to complete. FUSE will set > this flag on its mappings. This patchset adds checks to the critical parts of > reclaim, sync, and page migration logic where writeback may be waited on. > > Please note the following: > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no > effect on existing behavior. Dirty FUSE pages are already not guaranteed to > be written to disk by the time sync(2) returns (eg writeback is cleared on > the dirty page but the server may not have written out the temp page to disk > yet). If the caller wishes to ensure the data has actually been synced to > disk, they should use fsync(2)/fdatasync(2) instead. > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be > waited on when in writeback. There are some cases where the wait is > desirable. For example, for the sync_file_range() syscall, it is fine to > wait on the writeback since the caller passes in a fd for the operation. Looks good, thanks. Acked-by: Miklos Szeredi <mszeredi@redhat.com> I think this should go via the mm tree. Thanks, Miklos ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback 2024-12-13 11:52 ` Miklos Szeredi @ 2024-12-13 16:47 ` Shakeel Butt 2024-12-18 17:37 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-13 16:47 UTC (permalink / raw) To: Miklos Szeredi, Andrew Morton Cc: Joanne Koong, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team +Andrew On Fri, Dec 13, 2024 at 12:52:44PM +0100, Miklos Szeredi wrote: > On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@gmail.com> wrote: > > > > The purpose of this patchset is to help make writeback-cache write > > performance in FUSE filesystems as fast as possible. > > > > In the current FUSE writeback design (see commit 3be5a52b30aa > > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > > page to be written back, the contents of the dirty page are copied over to the > > temp page, and the temp page gets handed to the server to write back. This is > > done so that writeback may be immediately cleared on the dirty page, and this > > in turn is done for two reasons: > > a) in order to mitigate the following deadlock scenario that may arise if > > reclaim waits on writeback on the dirty page to complete (more details can be > > found in this thread [1]): > > * single-threaded FUSE server is in the middle of handling a request > > that needs a memory allocation > > * memory allocation triggers direct reclaim > > * direct reclaim waits on a folio under writeback > > * the FUSE server can't write back the folio since it's stuck in > > direct reclaim > > b) in order to unblock internal (eg sync, page compaction) waits on writeback > > without needing the server to complete writing back to disk, which may take > > an indeterminate amount of time. > > > > Allocating and copying dirty pages to temp pages is the biggest performance > > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > > altogether (which will also allow us to get rid of the internal FUSE rb tree > > that is needed to keep track of writeback status on the temp pages). > > Benchmarks show approximately a 20% improvement in throughput for 4k > > block-size writes and a 45% improvement for 1M block-size writes. > > > > With removing the temp page, writeback state is now only cleared on the dirty > > page after the server has written it back to disk. This may take an > > indeterminate amount of time. As well, there is also the possibility of > > malicious or well-intentioned but buggy servers where writeback may in the > > worst case scenario, never complete. This means that any > > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to > > be carefully audited. > > > > In particular, these are the cases that need to be accounted for: > > * potentially deadlocking in reclaim, as mentioned above > > * potentially stalling sync(2) > > * potentially stalling page migration / compaction > > > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which > > filesystems may set on its inode mappings to indicate that writeback > > operations may take an indeterminate amount of time to complete. FUSE will set > > this flag on its mappings. This patchset adds checks to the critical parts of > > reclaim, sync, and page migration logic where writeback may be waited on. > > > > Please note the following: > > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no > > effect on existing behavior. Dirty FUSE pages are already not guaranteed to > > be written to disk by the time sync(2) returns (eg writeback is cleared on > > the dirty page but the server may not have written out the temp page to disk > > yet). If the caller wishes to ensure the data has actually been synced to > > disk, they should use fsync(2)/fdatasync(2) instead. > > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be > > waited on when in writeback. There are some cases where the wait is > > desirable. For example, for the sync_file_range() syscall, it is fine to > > wait on the writeback since the caller passes in a fd for the operation. > > Looks good, thanks. > > Acked-by: Miklos Szeredi <mszeredi@redhat.com> > > I think this should go via the mm tree. Andrew, can you please pick this series up or Joanne can send an updated version with all Acks/Review tag collected? Let us know what you prefer. Thanks, Shakeel ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback 2024-12-13 16:47 ` Shakeel Butt @ 2024-12-18 17:37 ` Joanne Koong 2024-12-18 17:44 ` Shakeel Butt 0 siblings, 1 reply; 124+ messages in thread From: Joanne Koong @ 2024-12-18 17:37 UTC (permalink / raw) To: Shakeel Butt Cc: Miklos Szeredi, Andrew Morton, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team On Fri, Dec 13, 2024 at 8:47 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > +Andrew > > On Fri, Dec 13, 2024 at 12:52:44PM +0100, Miklos Szeredi wrote: > > On Sat, 23 Nov 2024 at 00:24, Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > > The purpose of this patchset is to help make writeback-cache write > > > performance in FUSE filesystems as fast as possible. > > > > > > In the current FUSE writeback design (see commit 3be5a52b30aa > > > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > > > page to be written back, the contents of the dirty page are copied over to the > > > temp page, and the temp page gets handed to the server to write back. This is > > > done so that writeback may be immediately cleared on the dirty page, and this > > > in turn is done for two reasons: > > > a) in order to mitigate the following deadlock scenario that may arise if > > > reclaim waits on writeback on the dirty page to complete (more details can be > > > found in this thread [1]): > > > * single-threaded FUSE server is in the middle of handling a request > > > that needs a memory allocation > > > * memory allocation triggers direct reclaim > > > * direct reclaim waits on a folio under writeback > > > * the FUSE server can't write back the folio since it's stuck in > > > direct reclaim > > > b) in order to unblock internal (eg sync, page compaction) waits on writeback > > > without needing the server to complete writing back to disk, which may take > > > an indeterminate amount of time. > > > > > > Allocating and copying dirty pages to temp pages is the biggest performance > > > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > > > altogether (which will also allow us to get rid of the internal FUSE rb tree > > > that is needed to keep track of writeback status on the temp pages). > > > Benchmarks show approximately a 20% improvement in throughput for 4k > > > block-size writes and a 45% improvement for 1M block-size writes. > > > > > > With removing the temp page, writeback state is now only cleared on the dirty > > > page after the server has written it back to disk. This may take an > > > indeterminate amount of time. As well, there is also the possibility of > > > malicious or well-intentioned but buggy servers where writeback may in the > > > worst case scenario, never complete. This means that any > > > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to > > > be carefully audited. > > > > > > In particular, these are the cases that need to be accounted for: > > > * potentially deadlocking in reclaim, as mentioned above > > > * potentially stalling sync(2) > > > * potentially stalling page migration / compaction > > > > > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which > > > filesystems may set on its inode mappings to indicate that writeback > > > operations may take an indeterminate amount of time to complete. FUSE will set > > > this flag on its mappings. This patchset adds checks to the critical parts of > > > reclaim, sync, and page migration logic where writeback may be waited on. > > > > > > Please note the following: > > > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no > > > effect on existing behavior. Dirty FUSE pages are already not guaranteed to > > > be written to disk by the time sync(2) returns (eg writeback is cleared on > > > the dirty page but the server may not have written out the temp page to disk > > > yet). If the caller wishes to ensure the data has actually been synced to > > > disk, they should use fsync(2)/fdatasync(2) instead. > > > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be > > > waited on when in writeback. There are some cases where the wait is > > > desirable. For example, for the sync_file_range() syscall, it is fine to > > > wait on the writeback since the caller passes in a fd for the operation. > > > > Looks good, thanks. > > > > Acked-by: Miklos Szeredi <mszeredi@redhat.com> > > > > I think this should go via the mm tree. > > Andrew, can you please pick this series up or Joanne can send an updated > version with all Acks/Review tag collected? Let us know what you prefer. > Hi Andrew, Could you let us know your preference or if there's anything else you need from us to proceed? Thanks, Joanne > Thanks, > Shakeel ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback 2024-12-18 17:37 ` Joanne Koong @ 2024-12-18 17:44 ` Shakeel Butt 2024-12-18 17:53 ` Joanne Koong 0 siblings, 1 reply; 124+ messages in thread From: Shakeel Butt @ 2024-12-18 17:44 UTC (permalink / raw) To: Joanne Koong Cc: Miklos Szeredi, Andrew Morton, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team On Wed, Dec 18, 2024 at 09:37:37AM -0800, Joanne Koong wrote: [...] > > Hi Andrew, > > Could you let us know your preference or if there's anything else you > need from us to proceed? > Andrew has already picked the series into mm-tree (mm-unstable). ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback 2024-12-18 17:44 ` Shakeel Butt @ 2024-12-18 17:53 ` Joanne Koong 0 siblings, 0 replies; 124+ messages in thread From: Joanne Koong @ 2024-12-18 17:53 UTC (permalink / raw) To: Shakeel Butt Cc: Miklos Szeredi, Andrew Morton, linux-fsdevel, jefflexu, josef, bernd.schubert, linux-mm, kernel-team On Wed, Dec 18, 2024 at 9:44 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, Dec 18, 2024 at 09:37:37AM -0800, Joanne Koong wrote: > [...] > > > > Hi Andrew, > > > > Could you let us know your preference or if there's anything else you > > need from us to proceed? > > > > Andrew has already picked the series into mm-tree (mm-unstable). > Great, thanks. ^ permalink raw reply [flat|nested] 124+ messages in thread
end of thread, other threads:[~2025-04-03 22:04 UTC | newest] Thread overview: 124+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong 2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong 2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong 2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong 2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong 2024-12-19 13:05 ` David Hildenbrand 2024-12-19 14:19 ` Zi Yan 2024-12-19 15:08 ` Zi Yan 2024-12-19 15:39 ` David Hildenbrand 2024-12-19 15:47 ` Zi Yan 2024-12-19 15:50 ` David Hildenbrand 2024-12-19 15:43 ` Shakeel Butt 2024-12-19 15:47 ` David Hildenbrand 2024-12-19 15:53 ` Shakeel Butt 2024-12-19 15:55 ` Zi Yan 2024-12-19 15:56 ` Bernd Schubert 2024-12-19 16:00 ` Zi Yan 2024-12-19 16:02 ` Zi Yan 2024-12-19 16:09 ` Bernd Schubert 2024-12-19 16:14 ` Zi Yan 2024-12-19 16:26 ` Shakeel Butt 2024-12-19 16:31 ` David Hildenbrand 2024-12-19 16:53 ` Shakeel Butt 2024-12-19 16:22 ` Shakeel Butt 2024-12-19 16:29 ` David Hildenbrand 2024-12-19 16:40 ` Shakeel Butt 2024-12-19 16:41 ` David Hildenbrand 2024-12-19 17:14 ` Shakeel Butt 2024-12-19 17:26 ` David Hildenbrand 2024-12-19 17:30 ` Bernd Schubert 2024-12-19 17:37 ` Shakeel Butt 2024-12-19 17:40 ` Bernd Schubert 2024-12-19 17:44 ` Joanne Koong 2024-12-19 17:54 ` Shakeel Butt 2024-12-20 11:44 ` David Hildenbrand 2024-12-20 12:15 ` Bernd Schubert 2024-12-20 14:49 ` David Hildenbrand 2024-12-20 15:26 ` Bernd Schubert 2024-12-20 18:01 ` Shakeel Butt 2024-12-21 2:28 ` Jingbo Xu 2024-12-21 16:23 ` David Hildenbrand 2024-12-22 2:47 ` Jingbo Xu 2024-12-24 11:32 ` David Hildenbrand 2024-12-21 16:18 ` David Hildenbrand 2024-12-23 22:14 ` Shakeel Butt 2024-12-24 12:37 ` David Hildenbrand 2024-12-26 15:11 ` Zi Yan 2024-12-26 20:13 ` Shakeel Butt 2024-12-26 22:02 ` Bernd Schubert 2024-12-27 20:08 ` Joanne Koong 2024-12-27 20:32 ` Bernd Schubert 2024-12-30 17:52 ` Joanne Koong 2024-12-30 10:16 ` David Hildenbrand 2024-12-30 18:38 ` Joanne Koong 2024-12-30 19:52 ` David Hildenbrand 2024-12-30 20:11 ` Shakeel Butt 2025-01-02 18:54 ` Joanne Koong 2025-01-03 20:31 ` David Hildenbrand 2025-01-06 10:19 ` Miklos Szeredi 2025-01-06 18:17 ` Shakeel Butt 2025-01-07 8:34 ` David Hildenbrand 2025-01-07 18:07 ` Shakeel Butt 2025-01-09 11:22 ` David Hildenbrand 2025-01-10 20:28 ` Jeff Layton 2025-01-10 21:13 ` David Hildenbrand 2025-01-10 22:00 ` Shakeel Butt 2025-01-13 15:27 ` David Hildenbrand 2025-01-13 21:44 ` Jeff Layton 2025-01-14 8:38 ` Miklos Szeredi 2025-01-14 9:40 ` Miklos Szeredi 2025-01-14 9:55 ` Bernd Schubert 2025-01-14 10:07 ` Miklos Szeredi 2025-01-14 18:07 ` Joanne Koong 2025-01-14 18:58 ` Miklos Szeredi 2025-01-14 19:12 ` Joanne Koong 2025-01-14 20:00 ` Miklos Szeredi 2025-01-14 20:29 ` Jeff Layton 2025-01-14 21:40 ` Bernd Schubert 2025-01-23 16:06 ` Pavel Begunkov 2025-01-14 20:51 ` Joanne Koong 2025-01-24 12:25 ` David Hildenbrand 2025-01-14 15:49 ` Jeff Layton 2025-01-24 12:29 ` David Hildenbrand 2025-01-28 10:16 ` Miklos Szeredi 2025-01-14 15:44 ` Jeff Layton 2025-01-14 18:58 ` Joanne Koong 2025-01-10 23:11 ` Jeff Layton 2025-01-10 20:16 ` Jeff Layton 2025-01-10 20:20 ` David Hildenbrand 2025-01-10 20:43 ` Jeff Layton 2025-01-10 21:00 ` David Hildenbrand 2025-01-10 21:07 ` Jeff Layton 2025-01-10 21:21 ` David Hildenbrand 2025-01-07 16:15 ` Miklos Szeredi 2025-01-08 1:40 ` Jingbo Xu 2024-12-30 20:04 ` Shakeel Butt 2025-01-02 19:59 ` Joanne Koong 2025-01-02 20:26 ` Zi Yan 2024-12-20 21:01 ` Joanne Koong 2024-12-21 16:25 ` David Hildenbrand 2024-12-21 21:59 ` Bernd Schubert 2024-12-23 19:00 ` Joanne Koong 2024-12-26 22:44 ` Bernd Schubert 2024-12-27 18:25 ` Joanne Koong 2024-12-19 17:55 ` Joanne Koong 2024-12-19 18:04 ` Bernd Schubert 2024-12-19 18:11 ` Shakeel Butt 2024-12-20 7:55 ` Jingbo Xu 2025-04-02 21:34 ` Joanne Koong 2025-04-03 3:31 ` Jingbo Xu 2025-04-03 9:18 ` David Hildenbrand 2025-04-03 9:25 ` Bernd Schubert 2025-04-03 9:35 ` Christian Brauner 2025-04-03 19:09 ` Joanne Koong 2025-04-03 20:44 ` David Hildenbrand 2025-04-03 22:04 ` Joanne Koong 2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong 2024-11-25 9:46 ` Jingbo Xu 2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong 2024-12-13 11:52 ` Miklos Szeredi 2024-12-13 16:47 ` Shakeel Butt 2024-12-18 17:37 ` Joanne Koong 2024-12-18 17:44 ` Shakeel Butt 2024-12-18 17:53 ` Joanne Koong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).