* [PATCHSET v7 0/11] Uncached buffered IO
@ 2024-12-13 15:55 Jens Axboe
2024-12-13 15:55 ` [PATCH 01/11] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
` (13 more replies)
0 siblings, 14 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, willy, kirill, bfoster
Hi,
5 years ago I posted patches adding support for RWF_UNCACHED, as a way
to do buffered IO that isn't page cache persistent. The approach back
then was to have private pages for IO, and then get rid of them once IO
was done. But that then runs into all the issues that O_DIRECT has, in
terms of synchronizing with the page cache.
So here's a new approach to the same concent, but using the page cache
as synchronization. Due to excessive bike shedding on the naming, this
is now named RWF_DONTCACHE, and is less special in that it's just page
cache IO, except it prunes the ranges once IO is completed.
Why do this, you may ask? The tldr is that device speeds are only
getting faster, while reclaim is not. Doing normal buffered IO can be
very unpredictable, and suck up a lot of resources on the reclaim side.
This leads people to use O_DIRECT as a work-around, which has its own
set of restrictions in terms of size, offset, and length of IO. It's
also inherently synchronous, and now you need async IO as well. While
the latter isn't necessarily a big problem as we have good options
available there, it also should not be a requirement when all you want
to do is read or write some data without caching.
Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds. On the big system I used for testing, there's a
lot more RAM, but also a lot more devices. As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it. Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board.
Common for both reads and writes with RWF_DONTCACHE is that they use the
page cache for IO. Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied. For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion
will prune the range. Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by
RWF_DONTCACHE IO are left untouched. This means you that uncached IO
will take advantage of the page cache for uptodate data, but not leave
anything it instantiated/created in cache.
File systems need to support this. This patchset adds support for the
generic read path, which covers file systems like ext4. Patches exist to
add support for iomap/XFS and btrfs as well, which sit on top of this
series. If RWF_DONTCACHE IO is attempted on a file system that doesn't
support it, -EOPNOTSUPP is returned. Hence the user can rely on it
either working as designed, or flagging and error if that's not the
case. The intent here is to give the application a sensible fallback
path - eg, it may fall back to O_DIRECT if appropriate, or just live
with the fact that uncached IO isn't available and do normal buffered
IO.
Adding "support" to other file systems should be trivial, most of the
time just a one-liner adding FOP_DONTCACHE to the fop_flags in the
file_operations struct.
Performance results are in patch 8 for reads, and you can find the write
side results in the XFS patch adding support for DONTCACHE writes for
XFS:
://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached.9&id=edd7b1c910c5251941c6ba179f44b4c81a089019
with the tldr being that I see about a 65% improvement in performance
for both, with fully predictable IO times. CPU reduction is substantial
as well, with no kswapd activity at all for reclaim when using
uncached IO.
Using it from applications is trivial - just set RWF_DONTCACHE for the
read or write, using pwritev2(2) or preadv2(2). For io_uring, same
thing, just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write
operation. And that's it.
Patches 1..7 are just prep patches, and should have no functional
changes at all. Patch 8 adds support for the filemap path for
RWF_DONTCACHE reads, and patches 9..11 are just prep patches for
supporting the write side of uncached writes. In the below mentioned
branch, there are then patches to adopt uncached reads and writes for
xfs, btrfs, and ext4. The latter currently relies on bit of a hack for
passing whether this is an uncached write or not through
->write_begin(), which can hopefully go away once ext4 adopts iomap for
buffered writes. I say this is a hack as it's not the prettiest way to
do it, however it is fully solid and will work just fine.
Passes full xfstests and fsx overnight runs, no issues observed. That
includes the vm running the testing also using RWF_DONTCACHE on the
host. I'll post fsstress and fsx patches for RWF_DONTCACHE separately.
As far as I'm concerned, no further work needs doing here.
And git tree for the patches is here:
https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.9
include/linux/fs.h | 21 +++++++-
include/linux/page-flags.h | 5 ++
include/linux/pagemap.h | 1 +
include/trace/events/mmflags.h | 3 +-
include/uapi/linux/fs.h | 6 ++-
mm/filemap.c | 97 +++++++++++++++++++++++++++++-----
mm/internal.h | 2 +
mm/readahead.c | 22 ++++++--
mm/swap.c | 2 +
mm/truncate.c | 54 ++++++++++---------
10 files changed, 166 insertions(+), 47 deletions(-)
Since v6
- Rename the PG_uncached flag to PG_dropbehind
- Shuffle patches around a bit, most notably so the foliop_uncached
patch goes with the ext4 support
- Get rid of foliop_uncached hack for btrfs (Christoph)
- Get rid of passing in struct address_space to filemap_create_folio()
- Inline invalidate_complete_folio2() in folio_unmap_invalidate() rather
than keep it as a separate helper
- Rebase on top of current master
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 01/11] mm/filemap: change filemap_create_folio() to take a struct kiocb
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-20 10:51 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 02/11] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
` (12 subsequent siblings)
13 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead. With the kiocb being passed in, skip
passing in the address_space separately as well. While doing so, move the
ki_flags checking into filemap_create_folio() as well. In preparation for
actually needing the kiocb in the function.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index f61cf51c2238..8b29323b15d7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2459,15 +2459,17 @@ static int filemap_update_page(struct kiocb *iocb,
return error;
}
-static int filemap_create_folio(struct file *file,
- struct address_space *mapping, loff_t pos,
- struct folio_batch *fbatch)
+static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch)
{
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
struct folio *folio;
int error;
unsigned int min_order = mapping_min_folio_order(mapping);
pgoff_t index;
+ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
+ return -EAGAIN;
+
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
if (!folio)
return -ENOMEM;
@@ -2486,7 +2488,7 @@ static int filemap_create_folio(struct file *file,
* well to keep locking rules simple.
*/
filemap_invalidate_lock_shared(mapping);
- index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
+ index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order;
error = filemap_add_folio(mapping, folio, index,
mapping_gfp_constraint(mapping, GFP_KERNEL));
if (error == -EEXIST)
@@ -2494,7 +2496,8 @@ static int filemap_create_folio(struct file *file,
if (error)
goto error;
- error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
+ error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
+ folio);
if (error)
goto error;
@@ -2550,9 +2553,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
}
if (!folio_batch_count(fbatch)) {
- if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
- return -EAGAIN;
- err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
+ err = filemap_create_folio(iocb, fbatch);
if (err == AOP_TRUNCATED_PAGE)
goto retry;
return err;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 02/11] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
2024-12-13 15:55 ` [PATCH 01/11] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-20 10:56 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 03/11] mm/readahead: add folio allocation helper Jens Axboe
` (11 subsequent siblings)
13 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe,
Christoph Hellwig
Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for needing
to modify ractl inside filemap_get_pages().
No functional changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 8b29323b15d7..220dc7c6e12f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2527,7 +2527,6 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
- struct file_ra_state *ra = &filp->f_ra;
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
pgoff_t last_index;
struct folio *folio;
@@ -2542,12 +2541,13 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
if (!folio_batch_count(fbatch)) {
+ DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index);
+
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
- page_cache_sync_readahead(mapping, ra, filp, index,
- last_index - index);
+ page_cache_sync_ra(&ractl, last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
memalloc_noio_restore(flags);
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 03/11] mm/readahead: add folio allocation helper
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
2024-12-13 15:55 ` [PATCH 01/11] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-12-13 15:55 ` [PATCH 02/11] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-20 10:59 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 04/11] mm: add PG_dropbehind folio flag Jens Axboe
` (10 subsequent siblings)
13 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Just a wrapper around filemap_alloc_folio() for now, but add it in
preparation for modifying the folio based on the 'ractl' being passed
in.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/readahead.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index ea650b8b02fb..8a62ad4106ff 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -188,6 +188,12 @@ static void read_pages(struct readahead_control *rac)
BUG_ON(readahead_count(rac));
}
+static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
+ gfp_t gfp_mask, unsigned int order)
+{
+ return filemap_alloc_folio(gfp_mask, order);
+}
+
/**
* page_cache_ra_unbounded - Start unchecked readahead.
* @ractl: Readahead control.
@@ -265,8 +271,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
continue;
}
- folio = filemap_alloc_folio(gfp_mask,
- mapping_min_folio_order(mapping));
+ folio = ractl_alloc_folio(ractl, gfp_mask,
+ mapping_min_folio_order(mapping));
if (!folio)
break;
@@ -436,7 +442,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
pgoff_t mark, unsigned int order, gfp_t gfp)
{
int err;
- struct folio *folio = filemap_alloc_folio(gfp, order);
+ struct folio *folio = ractl_alloc_folio(ractl, gfp, order);
if (!folio)
return -ENOMEM;
@@ -750,7 +756,7 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, min_order);
+ folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
if (!folio)
return;
@@ -779,7 +785,7 @@ void readahead_expand(struct readahead_control *ractl,
if (folio && !xa_is_value(folio))
return; /* Folio apparently present */
- folio = filemap_alloc_folio(gfp_mask, min_order);
+ folio = ractl_alloc_folio(ractl, gfp_mask, min_order);
if (!folio)
return;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (2 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 03/11] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-20 11:08 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 05/11] mm/readahead: add readahead_control->dropbehind member Jens Axboe
` (9 subsequent siblings)
13 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Add a folio flag that file IO can use to indicate that the cached IO
being done should be dropped from the page cache upon completion.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/page-flags.h | 5 +++++
include/trace/events/mmflags.h | 3 ++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cf46ac720802..16607f02abd0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -110,6 +110,7 @@ enum pageflags {
PG_reclaim, /* To be reclaimed asap */
PG_swapbacked, /* Page is backed by RAM/swap */
PG_unevictable, /* Page is "unevictable" */
+ PG_dropbehind, /* drop pages on IO completion */
#ifdef CONFIG_MMU
PG_mlocked, /* Page is vma mlocked */
#endif
@@ -562,6 +563,10 @@ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
FOLIO_FLAG(readahead, FOLIO_HEAD_PAGE)
FOLIO_TEST_CLEAR_FLAG(readahead, FOLIO_HEAD_PAGE)
+FOLIO_FLAG(dropbehind, FOLIO_HEAD_PAGE)
+ FOLIO_TEST_CLEAR_FLAG(dropbehind, FOLIO_HEAD_PAGE)
+ __FOLIO_SET_FLAG(dropbehind, FOLIO_HEAD_PAGE)
+
#ifdef CONFIG_HIGHMEM
/*
* Must use a macro here due to header dependency issues. page_zone() is not
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index bb8a59c6caa2..3bc8656c8359 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -116,7 +116,8 @@
DEF_PAGEFLAG_NAME(head), \
DEF_PAGEFLAG_NAME(reclaim), \
DEF_PAGEFLAG_NAME(swapbacked), \
- DEF_PAGEFLAG_NAME(unevictable) \
+ DEF_PAGEFLAG_NAME(unevictable), \
+ DEF_PAGEFLAG_NAME(dropbehind) \
IF_HAVE_PG_MLOCK(mlocked) \
IF_HAVE_PG_HWPOISON(hwpoison) \
IF_HAVE_PG_IDLE(idle) \
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 05/11] mm/readahead: add readahead_control->dropbehind member
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (3 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 04/11] mm: add PG_dropbehind folio flag Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-20 11:09 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 06/11] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
` (8 subsequent siblings)
13 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
If ractl->dropbehind is set to true, then folios created are marked as
dropbehind as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/pagemap.h | 1 +
mm/readahead.c | 8 +++++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bcf0865a38ae..5da4b6d42fae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1353,6 +1353,7 @@ struct readahead_control {
pgoff_t _index;
unsigned int _nr_pages;
unsigned int _batch_count;
+ bool dropbehind;
bool _workingset;
unsigned long _pflags;
};
diff --git a/mm/readahead.c b/mm/readahead.c
index 8a62ad4106ff..c0a6dc5d5686 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -191,7 +191,13 @@ static void read_pages(struct readahead_control *rac)
static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
gfp_t gfp_mask, unsigned int order)
{
- return filemap_alloc_folio(gfp_mask, order);
+ struct folio *folio;
+
+ folio = filemap_alloc_folio(gfp_mask, order);
+ if (folio && ractl->dropbehind)
+ __folio_set_dropbehind(folio);
+
+ return folio;
}
/**
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 06/11] mm/truncate: add folio_unmap_invalidate() helper
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (4 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 05/11] mm/readahead: add readahead_control->dropbehind member Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-20 11:13 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 07/11] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag Jens Axboe
` (7 subsequent siblings)
13 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Add a folio_unmap_invalidate() helper, which unmaps and invalidates a
given folio. The caller must already have locked the folio. Embed the
old invalidate_complete_folio2() helper in there as well, as nobody else
calls it.
Use this new helper in invalidate_inode_pages2_range(), rather than
duplicate the code there.
In preparation for using this elsewhere as well, have it take a gfp_t
mask rather than assume GFP_KERNEL is the right choice. This bubbles
back to invalidate_complete_folio2() as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/internal.h | 2 ++
mm/truncate.c | 54 ++++++++++++++++++++++++++-------------------------
2 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..ed3c3690eb03 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -392,6 +392,8 @@ void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct zap_details *details);
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+ gfp_t gfp);
void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
unsigned int order);
diff --git a/mm/truncate.c b/mm/truncate.c
index 7c304d2f0052..7b6f4399319f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -525,6 +525,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
}
EXPORT_SYMBOL(invalidate_mapping_pages);
+static int folio_launder(struct address_space *mapping, struct folio *folio)
+{
+ if (!folio_test_dirty(folio))
+ return 0;
+ if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL)
+ return 0;
+ return mapping->a_ops->launder_folio(folio);
+}
+
/*
* This is like mapping_evict_folio(), except it ignores the folio's
* refcount. We do this because invalidate_inode_pages2() needs stronger
@@ -532,14 +541,26 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
* shrink_folio_list() has a temp ref on them, or because they're transiently
* sitting in the folio_add_lru() caches.
*/
-static int invalidate_complete_folio2(struct address_space *mapping,
- struct folio *folio)
+int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
+ gfp_t gfp)
{
- if (folio->mapping != mapping)
- return 0;
+ int ret;
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
- if (!filemap_release_folio(folio, GFP_KERNEL))
+ if (folio_test_dirty(folio))
return 0;
+ if (folio_mapped(folio))
+ unmap_mapping_folio(folio);
+ BUG_ON(folio_mapped(folio));
+
+ ret = folio_launder(mapping, folio);
+ if (ret)
+ return ret;
+ if (folio->mapping != mapping)
+ return -EBUSY;
+ if (!filemap_release_folio(folio, gfp))
+ return -EBUSY;
spin_lock(&mapping->host->i_lock);
xa_lock_irq(&mapping->i_pages);
@@ -558,16 +579,7 @@ static int invalidate_complete_folio2(struct address_space *mapping,
failed:
xa_unlock_irq(&mapping->i_pages);
spin_unlock(&mapping->host->i_lock);
- return 0;
-}
-
-static int folio_launder(struct address_space *mapping, struct folio *folio)
-{
- if (!folio_test_dirty(folio))
- return 0;
- if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL)
- return 0;
- return mapping->a_ops->launder_folio(folio);
+ return -EBUSY;
}
/**
@@ -629,18 +641,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
folio_unlock(folio);
continue;
}
- VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
folio_wait_writeback(folio);
-
- if (folio_mapped(folio))
- unmap_mapping_folio(folio);
- BUG_ON(folio_mapped(folio));
-
- ret2 = folio_launder(mapping, folio);
- if (ret2 == 0) {
- if (!invalidate_complete_folio2(mapping, folio))
- ret2 = -EBUSY;
- }
+ ret2 = folio_unmap_invalidate(mapping, folio, GFP_KERNEL);
if (ret2 < 0)
ret = ret2;
folio_unlock(folio);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 07/11] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (5 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 06/11] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-13 15:55 ` [PATCH 08/11] mm/filemap: add read support for RWF_DONTCACHE Jens Axboe
` (6 subsequent siblings)
13 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
If a file system supports uncached buffered IO, it may set FOP_DONTCACHE
and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted
without the file system supporting it, it'll get errored with -EOPNOTSUPP.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 14 +++++++++++++-
include/uapi/linux/fs.h | 6 +++++-
2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..6a838b5479a6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -322,6 +322,7 @@ struct readahead_control;
#define IOCB_NOWAIT (__force int) RWF_NOWAIT
#define IOCB_APPEND (__force int) RWF_APPEND
#define IOCB_ATOMIC (__force int) RWF_ATOMIC
+#define IOCB_DONTCACHE (__force int) RWF_DONTCACHE
/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD (1 << 16)
@@ -356,7 +357,8 @@ struct readahead_control;
{ IOCB_SYNC, "SYNC" }, \
{ IOCB_NOWAIT, "NOWAIT" }, \
{ IOCB_APPEND, "APPEND" }, \
- { IOCB_ATOMIC, "ATOMIC"}, \
+ { IOCB_ATOMIC, "ATOMIC" }, \
+ { IOCB_DONTCACHE, "DONTCACHE" }, \
{ IOCB_EVENTFD, "EVENTFD"}, \
{ IOCB_DIRECT, "DIRECT" }, \
{ IOCB_WRITE, "WRITE" }, \
@@ -2127,6 +2129,8 @@ struct file_operations {
#define FOP_UNSIGNED_OFFSET ((__force fop_flags_t)(1 << 5))
/* Supports asynchronous lock callbacks */
#define FOP_ASYNC_LOCK ((__force fop_flags_t)(1 << 6))
+/* File system supports uncached read/write buffered IO */
+#define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7))
/* Wrap a directory iterator that needs exclusive inode access */
int wrap_directory_iterator(struct file *, struct dir_context *,
@@ -3614,6 +3618,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
return -EOPNOTSUPP;
}
+ if (flags & RWF_DONTCACHE) {
+ /* file system must support it */
+ if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE))
+ return -EOPNOTSUPP;
+ /* DAX mappings not supported */
+ if (IS_DAX(ki->ki_filp->f_mapping->host))
+ return -EOPNOTSUPP;
+ }
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..56a4f93a08f4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -332,9 +332,13 @@ typedef int __bitwise __kernel_rwf_t;
/* Atomic Write */
#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040)
+/* buffered IO that drops the cache after reading or writing data */
+#define RWF_DONTCACHE ((__force __kernel_rwf_t)0x00000080)
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
- RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC)
+ RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
+ RWF_DONTCACHE)
#define PROCFS_IOCTL_MAGIC 'f'
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 08/11] mm/filemap: add read support for RWF_DONTCACHE
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (6 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 07/11] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-13 15:55 ` [PATCH 09/11] mm/filemap: drop streaming/uncached pages when writeback completes Jens Axboe
` (5 subsequent siblings)
13 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Add RWF_DONTCACHE as a read operation flag, which means that any data
read wil be removed from the page cache upon completion. Uses the page
cache to synchronize, and simply prunes folios that were instantiated
when the operation completes. While it would be possible to use private
pages for this, using the page cache as synchronization is handy for a
variety of reasons:
1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable
and the code to support this is much simpler as a result.
You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
it will copy the data, but unlike regular buffered IO, it doesn't run
into the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO
looks as follows:
Reading bs 65536, uncached 0
1s: 145945MB/sec
2s: 158067MB/sec
3s: 157007MB/sec
4s: 148622MB/sec
5s: 118824MB/sec
6s: 70494MB/sec
7s: 41754MB/sec
8s: 90811MB/sec
9s: 92204MB/sec
10s: 95178MB/sec
11s: 95488MB/sec
12s: 95552MB/sec
13s: 96275MB/sec
where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached
3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4
3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5
3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6
3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10
3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17
3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26
3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21
3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22
3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27
3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30
3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31
3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7
3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23
3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8
3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28
3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3
3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19
3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12
3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29
3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1
[...]
which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.
If the same test case is run with RWF_DONTCACHE set for the buffered read,
the output looks as follows:
Reading bs 65536, uncached 0
1s: 153144MB/sec
2s: 156760MB/sec
3s: 158110MB/sec
4s: 158009MB/sec
5s: 158043MB/sec
6s: 157638MB/sec
7s: 157999MB/sec
8s: 158024MB/sec
9s: 157764MB/sec
10s: 157477MB/sec
11s: 157417MB/sec
12s: 157455MB/sec
13s: 157233MB/sec
14s: 156692MB/sec
which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached
8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top
where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 28 ++++++++++++++++++++++++++--
mm/swap.c | 2 ++
2 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 220dc7c6e12f..77290ac205dc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2473,6 +2473,8 @@ static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch)
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
if (!folio)
return -ENOMEM;
+ if (iocb->ki_flags & IOCB_DONTCACHE)
+ __folio_set_dropbehind(folio);
/*
* Protect against truncate / hole punch. Grabbing invalidate_lock
@@ -2518,6 +2520,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
if (iocb->ki_flags & IOCB_NOIO)
return -EAGAIN;
+ if (iocb->ki_flags & IOCB_DONTCACHE)
+ ractl.dropbehind = 1;
page_cache_async_ra(&ractl, folio, last_index - folio->index);
return 0;
}
@@ -2547,6 +2551,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
return -EAGAIN;
if (iocb->ki_flags & IOCB_NOWAIT)
flags = memalloc_noio_save();
+ if (iocb->ki_flags & IOCB_DONTCACHE)
+ ractl.dropbehind = 1;
page_cache_sync_ra(&ractl, last_index - index);
if (iocb->ki_flags & IOCB_NOWAIT)
memalloc_noio_restore(flags);
@@ -2594,6 +2600,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
return (pos1 >> shift == pos2 >> shift);
}
+static void filemap_uncached_read(struct address_space *mapping,
+ struct folio *folio)
+{
+ if (!folio_test_dropbehind(folio))
+ return;
+ if (folio_test_writeback(folio) || folio_test_dirty(folio))
+ return;
+ if (folio_trylock(folio)) {
+ if (folio_test_clear_dropbehind(folio))
+ folio_unmap_invalidate(mapping, folio, 0);
+ folio_unlock(folio);
+ }
+}
+
/**
* filemap_read - Read data from the page cache.
* @iocb: The iocb to read.
@@ -2707,8 +2727,12 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
}
}
put_folios:
- for (i = 0; i < folio_batch_count(&fbatch); i++)
- folio_put(fbatch.folios[i]);
+ for (i = 0; i < folio_batch_count(&fbatch); i++) {
+ struct folio *folio = fbatch.folios[i];
+
+ filemap_uncached_read(mapping, folio);
+ folio_put(folio);
+ }
folio_batch_init(&fbatch);
} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa1..ba02bd5ba145 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -427,6 +427,8 @@ static void folio_inc_refs(struct folio *folio)
*/
void folio_mark_accessed(struct folio *folio)
{
+ if (folio_test_dropbehind(folio))
+ return;
if (lru_gen_enabled()) {
folio_inc_refs(folio);
return;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 09/11] mm/filemap: drop streaming/uncached pages when writeback completes
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (7 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 08/11] mm/filemap: add read support for RWF_DONTCACHE Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-13 15:55 ` [PATCH 10/11] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
` (4 subsequent siblings)
13 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
If the folio is marked as streaming, drop pages when writeback completes.
Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
uncached IO.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
mm/filemap.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 77290ac205dc..ec087bab1c97 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1599,6 +1599,27 @@ int folio_wait_private_2_killable(struct folio *folio)
}
EXPORT_SYMBOL(folio_wait_private_2_killable);
+/*
+ * If folio was marked as dropbehind, then pages should be dropped when writeback
+ * completes. Do that now. If we fail, it's likely because of a big folio -
+ * just reset dropbehind for that case and latter completions should invalidate.
+ */
+static void folio_end_dropbehind(struct folio *folio)
+{
+ /*
+ * Hitting !in_task() should not happen off RWF_DONTCACHE writeback,
+ * but can happen if normal writeback just happens to find dirty folios
+ * that were created as part of uncached writeback, and that writeback
+ * would otherwise not need non-IRQ handling. Just skip the
+ * invalidation in that case.
+ */
+ if (in_task() && folio_trylock(folio)) {
+ if (folio->mapping)
+ folio_unmap_invalidate(folio->mapping, folio, 0);
+ folio_unlock(folio);
+ }
+}
+
/**
* folio_end_writeback - End writeback against a folio.
* @folio: The folio.
@@ -1609,6 +1630,8 @@ EXPORT_SYMBOL(folio_wait_private_2_killable);
*/
void folio_end_writeback(struct folio *folio)
{
+ bool folio_dropbehind = false;
+
VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
/*
@@ -1630,9 +1653,14 @@ void folio_end_writeback(struct folio *folio)
* reused before the folio_wake_bit().
*/
folio_get(folio);
+ if (!folio_test_dirty(folio))
+ folio_dropbehind = folio_test_clear_dropbehind(folio);
if (__folio_end_writeback(folio))
folio_wake_bit(folio, PG_writeback);
acct_reclaim_writeback(folio);
+
+ if (folio_dropbehind)
+ folio_end_dropbehind(folio);
folio_put(folio);
}
EXPORT_SYMBOL(folio_end_writeback);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 10/11] mm/filemap: add filemap_fdatawrite_range_kick() helper
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (8 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 09/11] mm/filemap: drop streaming/uncached pages when writeback completes Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-13 15:55 ` [PATCH 11/11] mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue Jens Axboe
` (3 subsequent siblings)
13 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range. Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 2 ++
mm/filemap.c | 18 ++++++++++++++++++
2 files changed, 20 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a838b5479a6..653b5efa3d3f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2878,6 +2878,8 @@ extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
extern int __must_check file_check_and_advance_wb_err(struct file *file);
extern int __must_check file_write_and_wait_range(struct file *file,
loff_t start, loff_t end);
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+ loff_t end);
static inline int file_write_and_wait(struct file *file)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index ec087bab1c97..7db995cb5179 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,6 +449,24 @@ int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
}
EXPORT_SYMBOL(filemap_fdatawrite_range);
+/**
+ * filemap_fdatawrite_range_kick - start writeback on a range
+ * @mapping: target address_space
+ * @start: index to start writeback on
+ * @end: last (non-inclusive) index for writeback
+ *
+ * This is a non-integrity writeback helper, to start writing back folios
+ * for the indicated range.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start,
+ loff_t end)
+{
+ return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
+}
+EXPORT_SYMBOL_GPL(filemap_fdatawrite_range_kick);
+
/**
* filemap_flush - mostly a non-blocking flush
* @mapping: target address_space
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 11/11] mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (9 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 10/11] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
@ 2024-12-13 15:55 ` Jens Axboe
2024-12-17 15:58 ` [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (2 subsequent siblings)
13 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-13 15:55 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: hannes, clm, linux-kernel, willy, kirill, bfoster, Jens Axboe
When a buffered write submitted with IOCB_DONTCACHE has been successfully
submitted, call filemap_fdatawrite_range_kick() to kick off the IO.
File systems call generic_write_sync() for any successful buffered write
submission, hence add the logic here rather than needing to modify the
file system.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/fs.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 653b5efa3d3f..58a618853574 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2912,6 +2912,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
if (ret)
return ret;
+ } else if (iocb->ki_flags & IOCB_DONTCACHE) {
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
+
+ filemap_fdatawrite_range_kick(mapping, iocb->ki_pos,
+ iocb->ki_pos + count);
}
return count;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCHSET v7 0/11] Uncached buffered IO
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (10 preceding siblings ...)
2024-12-13 15:55 ` [PATCH 11/11] mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue Jens Axboe
@ 2024-12-17 15:58 ` Jens Axboe
2024-12-18 17:16 ` [PATCH] nfs: flag as supporting FOP_DONTCACHE Mike Snitzer
2024-12-20 11:25 ` [PATCHSET v7 0/11] Uncached buffered IO Kirill A. Shutemov
13 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-17 15:58 UTC (permalink / raw)
To: linux-mm, linux-fsdevel; +Cc: hannes, clm, linux-kernel, willy, kirill, bfoster
Hi,
I'd like to move this forward for 6.14, but getting crickets around
here.
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] nfs: flag as supporting FOP_DONTCACHE
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (11 preceding siblings ...)
2024-12-17 15:58 ` [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
@ 2024-12-18 17:16 ` Mike Snitzer
2024-12-19 16:54 ` Jens Axboe
2024-12-20 11:25 ` [PATCHSET v7 0/11] Uncached buffered IO Kirill A. Shutemov
13 siblings, 1 reply; 30+ messages in thread
From: Mike Snitzer @ 2024-12-18 17:16 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy, kirill,
bfoster, linux-nfs
On Fri, Dec 13, 2024 at 08:55:14AM -0700, Jens Axboe wrote:
> Hi,
>
> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way
> to do buffered IO that isn't page cache persistent. The approach back
> then was to have private pages for IO, and then get rid of them once IO
> was done. But that then runs into all the issues that O_DIRECT has, in
> terms of synchronizing with the page cache.
>
> So here's a new approach to the same concent, but using the page cache
> as synchronization. Due to excessive bike shedding on the naming, this
> is now named RWF_DONTCACHE, and is less special in that it's just page
> cache IO, except it prunes the ranges once IO is completed.
>
> Why do this, you may ask? The tldr is that device speeds are only
> getting faster, while reclaim is not. Doing normal buffered IO can be
> very unpredictable, and suck up a lot of resources on the reclaim side.
> This leads people to use O_DIRECT as a work-around, which has its own
> set of restrictions in terms of size, offset, and length of IO. It's
> also inherently synchronous, and now you need async IO as well. While
> the latter isn't necessarily a big problem as we have good options
> available there, it also should not be a requirement when all you want
> to do is read or write some data without caching.
>
> Even on desktop type systems, a normal NVMe device can fill the entire
> page cache in seconds. On the big system I used for testing, there's a
> lot more RAM, but also a lot more devices. As can be seen in some of the
> results in the following patches, you can still fill RAM in seconds even
> when there's 1TB of it. Hence this problem isn't solely a "big
> hyperscaler system" issue, it's common across the board.
>
> Common for both reads and writes with RWF_DONTCACHE is that they use the
> page cache for IO. Reads work just like a normal buffered read would,
> with the only exception being that the touched ranges will get pruned
> after data has been copied. For writes, the ranges will get writeback
> kicked off before the syscall returns, and then writeback completion
> will prune the range. Hence writes aren't synchronous, and it's easy to
> pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by
> RWF_DONTCACHE IO are left untouched. This means you that uncached IO
> will take advantage of the page cache for uptodate data, but not leave
> anything it instantiated/created in cache.
>
> File systems need to support this. This patchset adds support for the
> generic read path, which covers file systems like ext4. Patches exist to
> add support for iomap/XFS and btrfs as well, which sit on top of this
> series. If RWF_DONTCACHE IO is attempted on a file system that doesn't
> support it, -EOPNOTSUPP is returned. Hence the user can rely on it
> either working as designed, or flagging and error if that's not the
> case. The intent here is to give the application a sensible fallback
> path - eg, it may fall back to O_DIRECT if appropriate, or just live
> with the fact that uncached IO isn't available and do normal buffered
> IO.
>
> Adding "support" to other file systems should be trivial, most of the
> time just a one-liner adding FOP_DONTCACHE to the fop_flags in the
> file_operations struct.
>
> Performance results are in patch 8 for reads, and you can find the write
> side results in the XFS patch adding support for DONTCACHE writes for
> XFS:
>
> ://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached.9&id=edd7b1c910c5251941c6ba179f44b4c81a089019
>
> with the tldr being that I see about a 65% improvement in performance
> for both, with fully predictable IO times. CPU reduction is substantial
> as well, with no kswapd activity at all for reclaim when using
> uncached IO.
>
> Using it from applications is trivial - just set RWF_DONTCACHE for the
> read or write, using pwritev2(2) or preadv2(2). For io_uring, same
> thing, just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write
> operation. And that's it.
>
> Patches 1..7 are just prep patches, and should have no functional
> changes at all. Patch 8 adds support for the filemap path for
> RWF_DONTCACHE reads, and patches 9..11 are just prep patches for
> supporting the write side of uncached writes. In the below mentioned
> branch, there are then patches to adopt uncached reads and writes for
> xfs, btrfs, and ext4. The latter currently relies on bit of a hack for
> passing whether this is an uncached write or not through
> ->write_begin(), which can hopefully go away once ext4 adopts iomap for
> buffered writes. I say this is a hack as it's not the prettiest way to
> do it, however it is fully solid and will work just fine.
>
> Passes full xfstests and fsx overnight runs, no issues observed. That
> includes the vm running the testing also using RWF_DONTCACHE on the
> host. I'll post fsstress and fsx patches for RWF_DONTCACHE separately.
> As far as I'm concerned, no further work needs doing here.
>
> And git tree for the patches is here:
>
> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.9
>
> include/linux/fs.h | 21 +++++++-
> include/linux/page-flags.h | 5 ++
> include/linux/pagemap.h | 1 +
> include/trace/events/mmflags.h | 3 +-
> include/uapi/linux/fs.h | 6 ++-
> mm/filemap.c | 97 +++++++++++++++++++++++++++++-----
> mm/internal.h | 2 +
> mm/readahead.c | 22 ++++++--
> mm/swap.c | 2 +
> mm/truncate.c | 54 ++++++++++---------
> 10 files changed, 166 insertions(+), 47 deletions(-)
>
> Since v6
> - Rename the PG_uncached flag to PG_dropbehind
> - Shuffle patches around a bit, most notably so the foliop_uncached
> patch goes with the ext4 support
> - Get rid of foliop_uncached hack for btrfs (Christoph)
> - Get rid of passing in struct address_space to filemap_create_folio()
> - Inline invalidate_complete_folio2() in folio_unmap_invalidate() rather
> than keep it as a separate helper
> - Rebase on top of current master
>
> --
> Jens Axboe
>
>
Hi Jens,
You may recall I tested NFS to work with UNCACHED (now DONTCACHE).
I've rebased the required small changes, feel free to append this to
your series if you like.
More work is needed to inform knfsd to selectively use DONTCACHE, but
that will require more effort and coordination amongst the NFS kernel
team.
From: Mike Snitzer <snitzer@kernel.org>
Date: Thu, 14 Nov 2024 22:09:01 +0000
Subject: [PATCH] nfs: flag as supporting FOP_DONTCACHE
Jens says: "nfs just uses generic_file_read_iter(), so read side is
fine with the flag added and generic_perform_write() for the write
side, wrapped in some nfs jazz. So you can probably just set the flag
and be done with it."
Must also update nfs_write_begin() to set FGP_DONTCACHE in fgp flags
passed to __filemap_get_folio().
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfs/file.c | 3 +++
fs/nfs/nfs4file.c | 1 +
2 files changed, 4 insertions(+)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 1bb646752e466..ff5d4c97df494 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -354,6 +354,8 @@ static int nfs_write_begin(struct file *file, struct address_space *mapping,
file, mapping->host->i_ino, len, (long long) pos);
fgp |= fgf_set_order(len);
+ if (foliop_is_dropbehind(foliop))
+ fgp |= FGP_DONTCACHE;
start:
folio = __filemap_get_folio(mapping, pos >> PAGE_SHIFT, fgp,
mapping_gfp_mask(mapping));
@@ -909,5 +911,6 @@ const struct file_operations nfs_file_operations = {
.splice_write = iter_file_splice_write,
.check_flags = nfs_check_flags,
.setlease = simple_nosetlease,
+ .fop_flags = FOP_DONTCACHE,
};
EXPORT_SYMBOL_GPL(nfs_file_operations);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 1cd9652f3c280..83e2467b7c66f 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -467,4 +467,5 @@ const struct file_operations nfs4_file_operations = {
#else
.llseek = nfs_file_llseek,
#endif
+ .fop_flags = FOP_DONTCACHE,
};
--
2.44.0
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] nfs: flag as supporting FOP_DONTCACHE
2024-12-18 17:16 ` [PATCH] nfs: flag as supporting FOP_DONTCACHE Mike Snitzer
@ 2024-12-19 16:54 ` Jens Axboe
0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-19 16:54 UTC (permalink / raw)
To: Mike Snitzer
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy, kirill,
bfoster, linux-nfs
On 12/18/24 10:16 AM, Mike Snitzer wrote:
> On Fri, Dec 13, 2024 at 08:55:14AM -0700, Jens Axboe wrote:
>> Hi,
>>
>> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way
>> to do buffered IO that isn't page cache persistent. The approach back
>> then was to have private pages for IO, and then get rid of them once IO
>> was done. But that then runs into all the issues that O_DIRECT has, in
>> terms of synchronizing with the page cache.
>>
>> So here's a new approach to the same concent, but using the page cache
>> as synchronization. Due to excessive bike shedding on the naming, this
>> is now named RWF_DONTCACHE, and is less special in that it's just page
>> cache IO, except it prunes the ranges once IO is completed.
>>
>> Why do this, you may ask? The tldr is that device speeds are only
>> getting faster, while reclaim is not. Doing normal buffered IO can be
>> very unpredictable, and suck up a lot of resources on the reclaim side.
>> This leads people to use O_DIRECT as a work-around, which has its own
>> set of restrictions in terms of size, offset, and length of IO. It's
>> also inherently synchronous, and now you need async IO as well. While
>> the latter isn't necessarily a big problem as we have good options
>> available there, it also should not be a requirement when all you want
>> to do is read or write some data without caching.
>>
>> Even on desktop type systems, a normal NVMe device can fill the entire
>> page cache in seconds. On the big system I used for testing, there's a
>> lot more RAM, but also a lot more devices. As can be seen in some of the
>> results in the following patches, you can still fill RAM in seconds even
>> when there's 1TB of it. Hence this problem isn't solely a "big
>> hyperscaler system" issue, it's common across the board.
>>
>> Common for both reads and writes with RWF_DONTCACHE is that they use the
>> page cache for IO. Reads work just like a normal buffered read would,
>> with the only exception being that the touched ranges will get pruned
>> after data has been copied. For writes, the ranges will get writeback
>> kicked off before the syscall returns, and then writeback completion
>> will prune the range. Hence writes aren't synchronous, and it's easy to
>> pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by
>> RWF_DONTCACHE IO are left untouched. This means you that uncached IO
>> will take advantage of the page cache for uptodate data, but not leave
>> anything it instantiated/created in cache.
>>
>> File systems need to support this. This patchset adds support for the
>> generic read path, which covers file systems like ext4. Patches exist to
>> add support for iomap/XFS and btrfs as well, which sit on top of this
>> series. If RWF_DONTCACHE IO is attempted on a file system that doesn't
>> support it, -EOPNOTSUPP is returned. Hence the user can rely on it
>> either working as designed, or flagging and error if that's not the
>> case. The intent here is to give the application a sensible fallback
>> path - eg, it may fall back to O_DIRECT if appropriate, or just live
>> with the fact that uncached IO isn't available and do normal buffered
>> IO.
>>
>> Adding "support" to other file systems should be trivial, most of the
>> time just a one-liner adding FOP_DONTCACHE to the fop_flags in the
>> file_operations struct.
>>
>> Performance results are in patch 8 for reads, and you can find the write
>> side results in the XFS patch adding support for DONTCACHE writes for
>> XFS:
>>
>> ://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached.9&id=edd7b1c910c5251941c6ba179f44b4c81a089019
>>
>> with the tldr being that I see about a 65% improvement in performance
>> for both, with fully predictable IO times. CPU reduction is substantial
>> as well, with no kswapd activity at all for reclaim when using
>> uncached IO.
>>
>> Using it from applications is trivial - just set RWF_DONTCACHE for the
>> read or write, using pwritev2(2) or preadv2(2). For io_uring, same
>> thing, just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write
>> operation. And that's it.
>>
>> Patches 1..7 are just prep patches, and should have no functional
>> changes at all. Patch 8 adds support for the filemap path for
>> RWF_DONTCACHE reads, and patches 9..11 are just prep patches for
>> supporting the write side of uncached writes. In the below mentioned
>> branch, there are then patches to adopt uncached reads and writes for
>> xfs, btrfs, and ext4. The latter currently relies on bit of a hack for
>> passing whether this is an uncached write or not through
>> ->write_begin(), which can hopefully go away once ext4 adopts iomap for
>> buffered writes. I say this is a hack as it's not the prettiest way to
>> do it, however it is fully solid and will work just fine.
>>
>> Passes full xfstests and fsx overnight runs, no issues observed. That
>> includes the vm running the testing also using RWF_DONTCACHE on the
>> host. I'll post fsstress and fsx patches for RWF_DONTCACHE separately.
>> As far as I'm concerned, no further work needs doing here.
>>
>> And git tree for the patches is here:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.9
>>
>> include/linux/fs.h | 21 +++++++-
>> include/linux/page-flags.h | 5 ++
>> include/linux/pagemap.h | 1 +
>> include/trace/events/mmflags.h | 3 +-
>> include/uapi/linux/fs.h | 6 ++-
>> mm/filemap.c | 97 +++++++++++++++++++++++++++++-----
>> mm/internal.h | 2 +
>> mm/readahead.c | 22 ++++++--
>> mm/swap.c | 2 +
>> mm/truncate.c | 54 ++++++++++---------
>> 10 files changed, 166 insertions(+), 47 deletions(-)
>>
>> Since v6
>> - Rename the PG_uncached flag to PG_dropbehind
>> - Shuffle patches around a bit, most notably so the foliop_uncached
>> patch goes with the ext4 support
>> - Get rid of foliop_uncached hack for btrfs (Christoph)
>> - Get rid of passing in struct address_space to filemap_create_folio()
>> - Inline invalidate_complete_folio2() in folio_unmap_invalidate() rather
>> than keep it as a separate helper
>> - Rebase on top of current master
>>
>> --
>> Jens Axboe
>>
>>
>
>
> Hi Jens,
>
> You may recall I tested NFS to work with UNCACHED (now DONTCACHE).
> I've rebased the required small changes, feel free to append this to
> your series if you like.
>
> More work is needed to inform knfsd to selectively use DONTCACHE, but
> that will require more effort and coordination amongst the NFS kernel
> team.
Thanks Mike, I'll add it to the part 2 mix.
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 01/11] mm/filemap: change filemap_create_folio() to take a struct kiocb
2024-12-13 15:55 ` [PATCH 01/11] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
@ 2024-12-20 10:51 ` Kirill A. Shutemov
0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 10:51 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster
On Fri, Dec 13, 2024 at 08:55:15AM -0700, Jens Axboe wrote:
> Rather than pass in both the file and position directly from the kiocb,
> just take a struct kiocb instead. With the kiocb being passed in, skip
> passing in the address_space separately as well. While doing so, move the
> ki_flags checking into filemap_create_folio() as well. In preparation for
> actually needing the kiocb in the function.
>
> No functional changes in this patch.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 02/11] mm/filemap: use page_cache_sync_ra() to kick off read-ahead
2024-12-13 15:55 ` [PATCH 02/11] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
@ 2024-12-20 10:56 ` Kirill A. Shutemov
0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 10:56 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster, Christoph Hellwig
On Fri, Dec 13, 2024 at 08:55:16AM -0700, Jens Axboe wrote:
> Rather than use the page_cache_sync_readahead() helper, define our own
> ractl and use page_cache_sync_ra() directly. In preparation for needing
> to modify ractl inside filemap_get_pages().
>
> No functional changes in this patch.
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 03/11] mm/readahead: add folio allocation helper
2024-12-13 15:55 ` [PATCH 03/11] mm/readahead: add folio allocation helper Jens Axboe
@ 2024-12-20 10:59 ` Kirill A. Shutemov
0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 10:59 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster
On Fri, Dec 13, 2024 at 08:55:17AM -0700, Jens Axboe wrote:
> Just a wrapper around filemap_alloc_folio() for now, but add it in
> preparation for modifying the folio based on the 'ractl' being passed
> in.
>
> No functional changes in this patch.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-13 15:55 ` [PATCH 04/11] mm: add PG_dropbehind folio flag Jens Axboe
@ 2024-12-20 11:08 ` Kirill A. Shutemov
2024-12-20 14:38 ` Matthew Wilcox
2024-12-20 15:03 ` David Hildenbrand
0 siblings, 2 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 11:08 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster, David Hildenbrand, Vlastimil Babka
On Fri, Dec 13, 2024 at 08:55:18AM -0700, Jens Axboe wrote:
> Add a folio flag that file IO can use to indicate that the cached IO
> being done should be dropped from the page cache upon completion.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
+ David, Vlastimil.
I think we should consider converting existing folio_set_reclaim() /
SetPageReclaim() users to the new flag. From a quick scan, all of them
would benefit from dropping the page after writeback is complete instead
of leaving the folio on the LRU.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 05/11] mm/readahead: add readahead_control->dropbehind member
2024-12-13 15:55 ` [PATCH 05/11] mm/readahead: add readahead_control->dropbehind member Jens Axboe
@ 2024-12-20 11:09 ` Kirill A. Shutemov
0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 11:09 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster
On Fri, Dec 13, 2024 at 08:55:19AM -0700, Jens Axboe wrote:
> If ractl->dropbehind is set to true, then folios created are marked as
> dropbehind as well.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 06/11] mm/truncate: add folio_unmap_invalidate() helper
2024-12-13 15:55 ` [PATCH 06/11] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
@ 2024-12-20 11:13 ` Kirill A. Shutemov
2024-12-20 14:16 ` Jens Axboe
0 siblings, 1 reply; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 11:13 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster
On Fri, Dec 13, 2024 at 08:55:20AM -0700, Jens Axboe wrote:
> @@ -629,18 +641,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
> folio_unlock(folio);
> continue;
> }
> - VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
> folio_wait_writeback(folio);
Any particular reason you drop this VM_BUG_ON_FOLIO()?
> -
> - if (folio_mapped(folio))
> - unmap_mapping_folio(folio);
> - BUG_ON(folio_mapped(folio));
> -
> - ret2 = folio_launder(mapping, folio);
> - if (ret2 == 0) {
> - if (!invalidate_complete_folio2(mapping, folio))
> - ret2 = -EBUSY;
> - }
> + ret2 = folio_unmap_invalidate(mapping, folio, GFP_KERNEL);
> if (ret2 < 0)
> ret = ret2;
> folio_unlock(folio);
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCHSET v7 0/11] Uncached buffered IO
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
` (12 preceding siblings ...)
2024-12-18 17:16 ` [PATCH] nfs: flag as supporting FOP_DONTCACHE Mike Snitzer
@ 2024-12-20 11:25 ` Kirill A. Shutemov
2024-12-20 14:21 ` Jens Axboe
13 siblings, 1 reply; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 11:25 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster
> Since v6
> - Rename the PG_uncached flag to PG_dropbehind
> - Shuffle patches around a bit, most notably so the foliop_uncached
> patch goes with the ext4 support
> - Get rid of foliop_uncached hack for btrfs (Christoph)
> - Get rid of passing in struct address_space to filemap_create_folio()
> - Inline invalidate_complete_folio2() in folio_unmap_invalidate() rather
> than keep it as a separate helper
> - Rebase on top of current master
Hm. v6 had a patch that cleared the PG_uncached flag if the page accessed
via non-uncached lookup[1]. What happened to it? I don't see it here.
https://lore.kernel.org/all/20241203153232.92224-14-axboe@kernel.dk
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 06/11] mm/truncate: add folio_unmap_invalidate() helper
2024-12-20 11:13 ` Kirill A. Shutemov
@ 2024-12-20 14:16 ` Jens Axboe
0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-20 14:16 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster
On 12/20/24 4:13 AM, Kirill A. Shutemov wrote:
> On Fri, Dec 13, 2024 at 08:55:20AM -0700, Jens Axboe wrote:
>> @@ -629,18 +641,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>> folio_unlock(folio);
>> continue;
>> }
>> - VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio);
>> folio_wait_writeback(folio);
>
> Any particular reason you drop this VM_BUG_ON_FOLIO()?
No reason at all, I think it just slipped under the radar. I've put it back
now, thanks!
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCHSET v7 0/11] Uncached buffered IO
2024-12-20 11:25 ` [PATCHSET v7 0/11] Uncached buffered IO Kirill A. Shutemov
@ 2024-12-20 14:21 ` Jens Axboe
0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2024-12-20 14:21 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster
On 12/20/24 4:25 AM, Kirill A. Shutemov wrote:
>> Since v6
>> - Rename the PG_uncached flag to PG_dropbehind
>> - Shuffle patches around a bit, most notably so the foliop_uncached
>> patch goes with the ext4 support
>> - Get rid of foliop_uncached hack for btrfs (Christoph)
>> - Get rid of passing in struct address_space to filemap_create_folio()
>> - Inline invalidate_complete_folio2() in folio_unmap_invalidate() rather
>> than keep it as a separate helper
>> - Rebase on top of current master
>
> Hm. v6 had a patch that cleared the PG_uncached flag if the page accessed
> via non-uncached lookup[1]. What happened to it? I don't see it here.
>
> https://lore.kernel.org/all/20241203153232.92224-14-axboe@kernel.dk
Since I only needed these bits for the fs support, I didn't include it in
this series. However, I did move it back to the core series for v8, it's
this one:
https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached.10&id=e4b7e8f693caf84021424ebafa139f38c5599db3
to avoid having a core dependency for the patches adding support to
iomap and xfs/btrfs.
I'll send out a new version with just slight tweaks today. So far it
has the following changelog:
- Rename filemap_uncached_read() to filemap_end_dropbehind_read()
- Rename folio_end_dropbehind() to folio_end_dropbehind_write()
- Make the "mm: add FGP_DONTCACHE folio creation flag" patch part of
the base patches series, to avoid dependencies with btrfs/xfs/iomap
- Remove now dead IOMAP_F_DONTCACHE define and setting on xfs/iomap
where moving this patch back to teh core series is one of the entries.
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-20 11:08 ` Kirill A. Shutemov
@ 2024-12-20 14:38 ` Matthew Wilcox
2024-12-20 15:11 ` Kirill A. Shutemov
2024-12-20 15:03 ` David Hildenbrand
1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2024-12-20 14:38 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
bfoster, David Hildenbrand, Vlastimil Babka
On Fri, Dec 20, 2024 at 01:08:39PM +0200, Kirill A. Shutemov wrote:
> On Fri, Dec 13, 2024 at 08:55:18AM -0700, Jens Axboe wrote:
> > Add a folio flag that file IO can use to indicate that the cached IO
> > being done should be dropped from the page cache upon completion.
> >
> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
> + David, Vlastimil.
>
> I think we should consider converting existing folio_set_reclaim() /
> SetPageReclaim() users to the new flag. From a quick scan, all of them
> would benefit from dropping the page after writeback is complete instead
> of leaving the folio on the LRU.
Ooh, that would be nice. Removes the overloading of PG_reclaim with
PG_readahead, right?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-20 11:08 ` Kirill A. Shutemov
2024-12-20 14:38 ` Matthew Wilcox
@ 2024-12-20 15:03 ` David Hildenbrand
2024-12-20 15:12 ` Kirill A. Shutemov
1 sibling, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2024-12-20 15:03 UTC (permalink / raw)
To: Kirill A. Shutemov, Jens Axboe
Cc: linux-mm, linux-fsdevel, hannes, clm, linux-kernel, willy,
bfoster, Vlastimil Babka
On 20.12.24 12:08, Kirill A. Shutemov wrote:
> On Fri, Dec 13, 2024 at 08:55:18AM -0700, Jens Axboe wrote:
>> Add a folio flag that file IO can use to indicate that the cached IO
>> being done should be dropped from the page cache upon completion.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
> + David, Vlastimil.
>
> I think we should consider converting existing folio_set_reclaim() /
> SetPageReclaim() users to the new flag. From a quick scan, all of them
> would benefit from dropping the page after writeback is complete instead
> of leaving the folio on the LRU.
I wonder of there are some use cases where we write a lot of data to
then only consume it read-only from that point on (databases? fancy AI
stuff? no idea :) ).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-20 14:38 ` Matthew Wilcox
@ 2024-12-20 15:11 ` Kirill A. Shutemov
2024-12-20 15:45 ` Matthew Wilcox
0 siblings, 1 reply; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 15:11 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
bfoster, David Hildenbrand, Vlastimil Babka
On Fri, Dec 20, 2024 at 02:38:09PM +0000, Matthew Wilcox wrote:
> On Fri, Dec 20, 2024 at 01:08:39PM +0200, Kirill A. Shutemov wrote:
> > On Fri, Dec 13, 2024 at 08:55:18AM -0700, Jens Axboe wrote:
> > > Add a folio flag that file IO can use to indicate that the cached IO
> > > being done should be dropped from the page cache upon completion.
> > >
> > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >
> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >
> > + David, Vlastimil.
> >
> > I think we should consider converting existing folio_set_reclaim() /
> > SetPageReclaim() users to the new flag. From a quick scan, all of them
> > would benefit from dropping the page after writeback is complete instead
> > of leaving the folio on the LRU.
>
> Ooh, that would be nice. Removes the overloading of PG_reclaim with
> PG_readahead, right?
Yep.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-20 15:03 ` David Hildenbrand
@ 2024-12-20 15:12 ` Kirill A. Shutemov
2024-12-20 15:20 ` David Hildenbrand
0 siblings, 1 reply; 30+ messages in thread
From: Kirill A. Shutemov @ 2024-12-20 15:12 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
willy, bfoster, Vlastimil Babka
On Fri, Dec 20, 2024 at 04:03:44PM +0100, David Hildenbrand wrote:
> On 20.12.24 12:08, Kirill A. Shutemov wrote:
> > On Fri, Dec 13, 2024 at 08:55:18AM -0700, Jens Axboe wrote:
> > > Add a folio flag that file IO can use to indicate that the cached IO
> > > being done should be dropped from the page cache upon completion.
> > >
> > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >
> > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >
> > + David, Vlastimil.
> >
> > I think we should consider converting existing folio_set_reclaim() /
> > SetPageReclaim() users to the new flag. From a quick scan, all of them
> > would benefit from dropping the page after writeback is complete instead
> > of leaving the folio on the LRU.
>
> I wonder of there are some use cases where we write a lot of data to then
> only consume it read-only from that point on (databases? fancy AI stuff? no
> idea :) ).
Do we use PG_reclaim for such cases?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-20 15:12 ` Kirill A. Shutemov
@ 2024-12-20 15:20 ` David Hildenbrand
0 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2024-12-20 15:20 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
willy, bfoster, Vlastimil Babka
On 20.12.24 16:12, Kirill A. Shutemov wrote:
> On Fri, Dec 20, 2024 at 04:03:44PM +0100, David Hildenbrand wrote:
>> On 20.12.24 12:08, Kirill A. Shutemov wrote:
>>> On Fri, Dec 13, 2024 at 08:55:18AM -0700, Jens Axboe wrote:
>>>> Add a folio flag that file IO can use to indicate that the cached IO
>>>> being done should be dropped from the page cache upon completion.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>
>>> + David, Vlastimil.
>>>
>>> I think we should consider converting existing folio_set_reclaim() /
>>> SetPageReclaim() users to the new flag. From a quick scan, all of them
>>> would benefit from dropping the page after writeback is complete instead
>>> of leaving the folio on the LRU.
>>
>> I wonder of there are some use cases where we write a lot of data to then
>> only consume it read-only from that point on (databases? fancy AI stuff? no
>> idea :) ).
>
> Do we use PG_reclaim for such cases?
Good point, I looked at the pageout() case, but there we really want to
pageout that thing and not just write it back, so makes sense to me.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 04/11] mm: add PG_dropbehind folio flag
2024-12-20 15:11 ` Kirill A. Shutemov
@ 2024-12-20 15:45 ` Matthew Wilcox
0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2024-12-20 15:45 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Jens Axboe, linux-mm, linux-fsdevel, hannes, clm, linux-kernel,
bfoster, David Hildenbrand, Vlastimil Babka
On Fri, Dec 20, 2024 at 05:11:52PM +0200, Kirill A. Shutemov wrote:
> On Fri, Dec 20, 2024 at 02:38:09PM +0000, Matthew Wilcox wrote:
> > On Fri, Dec 20, 2024 at 01:08:39PM +0200, Kirill A. Shutemov wrote:
> > > On Fri, Dec 13, 2024 at 08:55:18AM -0700, Jens Axboe wrote:
> > > > Add a folio flag that file IO can use to indicate that the cached IO
> > > > being done should be dropped from the page cache upon completion.
> > > >
> > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >
> > > Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > >
> > > + David, Vlastimil.
> > >
> > > I think we should consider converting existing folio_set_reclaim() /
> > > SetPageReclaim() users to the new flag. From a quick scan, all of them
> > > would benefit from dropping the page after writeback is complete instead
> > > of leaving the folio on the LRU.
> >
> > Ooh, that would be nice. Removes the overloading of PG_reclaim with
> > PG_readahead, right?
>
> Yep.
Then ... maybe this series should just coopt the PG_reclaim flag for its
purposes?
Going through the users:
lru_deactivate_file()
---------------------
Called due to mapping_try_invalidate() failing to invalidate.
Absolutely, get rid of this folio as quickly as possible.
pageout()
---------
Again, we're trying to get rid of this folio. This time due to memory
pressure / reaching the end of the LRU list. Yup, we want it gone.
shrink_folio_list()
-------------------
Again, end of the LRU (both cases in this function)
zswap_writeback_entry()
-----------------------
This is exactly the same case that Jens is adding. The swapcache is
being used as a staging location, and we want the folio gone as soon as
writeback completes.
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2024-12-20 15:45 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-13 15:55 [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
2024-12-13 15:55 ` [PATCH 01/11] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-12-20 10:51 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 02/11] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
2024-12-20 10:56 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 03/11] mm/readahead: add folio allocation helper Jens Axboe
2024-12-20 10:59 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 04/11] mm: add PG_dropbehind folio flag Jens Axboe
2024-12-20 11:08 ` Kirill A. Shutemov
2024-12-20 14:38 ` Matthew Wilcox
2024-12-20 15:11 ` Kirill A. Shutemov
2024-12-20 15:45 ` Matthew Wilcox
2024-12-20 15:03 ` David Hildenbrand
2024-12-20 15:12 ` Kirill A. Shutemov
2024-12-20 15:20 ` David Hildenbrand
2024-12-13 15:55 ` [PATCH 05/11] mm/readahead: add readahead_control->dropbehind member Jens Axboe
2024-12-20 11:09 ` Kirill A. Shutemov
2024-12-13 15:55 ` [PATCH 06/11] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
2024-12-20 11:13 ` Kirill A. Shutemov
2024-12-20 14:16 ` Jens Axboe
2024-12-13 15:55 ` [PATCH 07/11] fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag Jens Axboe
2024-12-13 15:55 ` [PATCH 08/11] mm/filemap: add read support for RWF_DONTCACHE Jens Axboe
2024-12-13 15:55 ` [PATCH 09/11] mm/filemap: drop streaming/uncached pages when writeback completes Jens Axboe
2024-12-13 15:55 ` [PATCH 10/11] mm/filemap: add filemap_fdatawrite_range_kick() helper Jens Axboe
2024-12-13 15:55 ` [PATCH 11/11] mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue Jens Axboe
2024-12-17 15:58 ` [PATCHSET v7 0/11] Uncached buffered IO Jens Axboe
2024-12-18 17:16 ` [PATCH] nfs: flag as supporting FOP_DONTCACHE Mike Snitzer
2024-12-19 16:54 ` Jens Axboe
2024-12-20 11:25 ` [PATCHSET v7 0/11] Uncached buffered IO Kirill A. Shutemov
2024-12-20 14:21 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).