* [PATCHSET 0/5] Support for RWF_UNCACHED
@ 2019-12-10 16:24 Jens Axboe
  2019-12-10 16:24 ` [PATCH 1/5] fs: add read support " Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 16:24 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
Recently someone asked me how io_uring buffered IO compares to mmaped
IO in terms of performance. So I ran some tests with buffered IO, and
found the experience to be somewhat painful. The test case is pretty
basic, random reads over a dataset that's 10x the size of RAM.
Performance starts out fine, and then the page cache fills up and we
hit a throughput cliff. CPU usage of the IO threads go up, and we have
kswapd spending 100% of a core trying to keep up. Seeing that, I was
reminded of the many complaints I here about buffered IO, and the fact
that most of the folks complaining will ultimately bite the bullet and
move to O_DIRECT to just get the kernel out of the way.
But I don't think it needs to be like that. Switching to O_DIRECT isn't
always easily doable. The buffers have different life times, size and
alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
can be painful.
Seems to me that we have an opportunity to provide something that sits
somewhere in between buffered and O_DIRECT, and this is where
RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
following behavior:
- If the data is in cache, it remains in cache and the copy (in or out)
  is served to/from that.
- If the data is NOT in cache, we add it while performing the IO. When
  the IO is done, we remove it again.
With this, I can do 100% smooth buffered reads or writes without pushing
the kernel to the state where kswapd is sweating bullets. In fact it
doesn't even register.
Comments appreciated! Patches are against current git (ish), and can
also be found here:
https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached
 fs/ceph/file.c          |   2 +-
 fs/dax.c                |   2 +-
 fs/ext4/file.c          |   2 +-
 fs/iomap/apply.c        |   2 +-
 fs/iomap/buffered-io.c  |  75 +++++++++++++++++------
 fs/iomap/direct-io.c    |   3 +-
 fs/iomap/fiemap.c       |   5 +-
 fs/iomap/seek.c         |   6 +-
 fs/iomap/swapfile.c     |   2 +-
 fs/nfs/file.c           |   2 +-
 include/linux/fs.h      |   9 ++-
 include/linux/iomap.h   |   6 +-
 include/uapi/linux/fs.h |   5 +-
 mm/filemap.c            | 132 ++++++++++++++++++++++++++++++++++++----
 14 files changed, 208 insertions(+), 45 deletions(-)
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 20+ messages in thread
* [PATCH 1/5] fs: add read support for RWF_UNCACHED
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
@ 2019-12-10 16:24 ` Jens Axboe
  2019-12-10 16:24 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 16:24 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: Jens Axboe
If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll drop the
cache for buffered reads if we are the ones instantiating it. If the
data is already cached, we leave it cached.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      |  3 +++
 include/uapi/linux/fs.h |  5 ++++-
 mm/filemap.c            | 46 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..092ea2a4319b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_UNCACHED		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3418,6 +3419,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
 	if (flags & RWF_APPEND)
 		ki->ki_flags |= IOCB_APPEND;
+	if (flags & RWF_UNCACHED)
+		ki->ki_flags |= IOCB_UNCACHED;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..357ebb0e0c5d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* drop cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000040)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_UNCACHED)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..ed23a11b3e34 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -933,8 +933,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
 
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t offset, gfp_t gfp_mask)
+static int __add_to_page_cache(struct page *page, struct address_space *mapping,
+			       pgoff_t offset, gfp_t gfp_mask, bool lru)
 {
 	void *shadow = NULL;
 	int ret;
@@ -956,9 +956,17 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		WARN_ON_ONCE(PageActive(page));
 		if (!(gfp_mask & __GFP_WRITE) && shadow)
 			workingset_refault(page, shadow);
-		lru_cache_add(page);
+		if (lru)
+			lru_cache_add(page);
 	}
 	return ret;
+
+}
+
+int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask)
+{
+	return __add_to_page_cache(page, mapping, offset, gfp_mask, true);
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
@@ -2032,6 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	offset = *ppos & ~PAGE_MASK;
 
 	for (;;) {
+		bool drop_page = false;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -2048,6 +2057,9 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (!page) {
 			if (iocb->ki_flags & IOCB_NOWAIT)
 				goto would_block;
+			/* UNCACHED implies no read-ahead */
+			if (iocb->ki_flags & IOCB_UNCACHED)
+				goto no_cached_page;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -2147,6 +2159,26 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		offset &= ~PAGE_MASK;
 		prev_offset = offset;
 
+		/*
+		 * If we're dropping this page due to drop-behind, then
+		 * lock it first. Ignore errors here, we can just leave it
+		 * in the page cache. Note that we didn't add this page to
+		 * the LRU when we added it to the page cache. So if we
+		 * fail removing it, or lock it, add to the LRU.
+		 */
+		if (drop_page) {
+			bool addlru = true;
+
+			if (!lock_page_killable(page)) {
+				if (page->mapping == mapping)
+					addlru = !remove_mapping(mapping, page);
+				else
+					addlru = false;
+				unlock_page(page);
+			}
+			if (addlru)
+				lru_cache_add(page);
+		}
 		put_page(page);
 		written += ret;
 		if (!iov_iter_count(iter))
@@ -2234,8 +2266,12 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			error = -ENOMEM;
 			goto out;
 		}
-		error = add_to_page_cache_lru(page, mapping, index,
-				mapping_gfp_constraint(mapping, GFP_KERNEL));
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			drop_page = true;
+
+		error = __add_to_page_cache(page, mapping, index,
+				mapping_gfp_constraint(mapping, GFP_KERNEL),
+				!drop_page);
 		if (error) {
 			put_page(page);
 			if (error == -EEXIST) {
-- 
2.24.0
^ permalink raw reply related	[flat|nested] 20+ messages in thread
* [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
  2019-12-10 16:24 ` [PATCH 1/5] fs: add read support " Jens Axboe
@ 2019-12-10 16:24 ` Jens Axboe
  2019-12-10 16:24 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 16:24 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: Jens Axboe
Right now all callers pass in iocb->ki_pos, just pass in the iocb. This
is in preparation for using the iocb flags in generic_perform_write().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ceph/file.c     | 2 +-
 fs/ext4/file.c     | 2 +-
 fs/nfs/file.c      | 2 +-
 include/linux/fs.h | 3 ++-
 mm/filemap.c       | 8 +++++---
 5 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 11929d2bb594..096c009f188f 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1538,7 +1538,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * are pending vmtruncate. So write and vmtruncate
 		 * can not run at the same time
 		 */
-		written = generic_perform_write(file, from, pos);
+		written = generic_perform_write(file, from, iocb);
 		if (likely(written >= 0))
 			iocb->ki_pos = pos + written;
 		ceph_end_io_write(inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6a7293a5cda2..9ffb857765d5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -249,7 +249,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 		goto out;
 
 	current->backing_dev_info = inode_to_bdi(inode);
-	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+	ret = generic_perform_write(iocb->ki_filp, from, iocb);
 	current->backing_dev_info = NULL;
 
 out:
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8eb731d9be3e..d8f51a702a4e 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -624,7 +624,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	result = generic_write_checks(iocb, from);
 	if (result > 0) {
 		current->backing_dev_info = inode_to_bdi(inode);
-		result = generic_perform_write(file, from, iocb->ki_pos);
+		result = generic_perform_write(file, from, iocb);
 		current->backing_dev_info = NULL;
 	}
 	nfs_end_io_write(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 092ea2a4319b..bf58db1bc032 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3103,7 +3103,8 @@ extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
-extern ssize_t generic_perform_write(struct file *, struct iov_iter *, loff_t);
+extern ssize_t generic_perform_write(struct file *, struct iov_iter *,
+				     struct kiocb *);
 
 ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
 		rwf_t flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index ed23a11b3e34..fe37bd2b2630 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3302,10 +3302,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
 ssize_t generic_perform_write(struct file *file,
-				struct iov_iter *i, loff_t pos)
+				struct iov_iter *i, struct kiocb *iocb)
 {
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
+	loff_t pos = iocb->ki_pos;
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
@@ -3439,7 +3440,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
 
-		status = generic_perform_write(file, from, pos = iocb->ki_pos);
+		pos = iocb->ki_pos;
+		status = generic_perform_write(file, from, iocb);
 		/*
 		 * If generic_perform_write() returned a synchronous error
 		 * then we want to return the number of bytes which were
@@ -3471,7 +3473,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			 */
 		}
 	} else {
-		written = generic_perform_write(file, from, iocb->ki_pos);
+		written = generic_perform_write(file, from, iocb);
 		if (likely(written > 0))
 			iocb->ki_pos += written;
 	}
-- 
2.24.0
^ permalink raw reply related	[flat|nested] 20+ messages in thread
* [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
  2019-12-10 16:24 ` [PATCH 1/5] fs: add read support " Jens Axboe
  2019-12-10 16:24 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
@ 2019-12-10 16:24 ` Jens Axboe
  2019-12-10 16:55   ` Matthew Wilcox
  2019-12-11  0:23   ` Dave Chinner
  2019-12-10 16:24 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 16:24 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: Jens Axboe
If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
cache instantiated for buffered writes. If new pages aren't
instantiated, we leave them alone. This provides similar semantics to
reads with RWF_UNCACHED set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h |  3 ++
 mm/filemap.c       | 78 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 76 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bf58db1bc032..bcf486c132a8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -285,6 +285,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0002 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_UNCACHED		0x0004
 
 /*
  * oh the beauties of C type declarations.
@@ -3105,6 +3106,8 @@ extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_perform_write(struct file *, struct iov_iter *,
 				     struct kiocb *);
+extern void write_drop_cached_pages(struct page **,
+				    struct address_space *mapping, unsigned *);
 
 ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
 		rwf_t flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index fe37bd2b2630..d6171bf705f9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3287,10 +3287,12 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	struct page *page;
-	int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
+	int fgp_flags = FGP_LOCK|FGP_WRITE;
 
 	if (flags & AOP_FLAG_NOFS)
 		fgp_flags |= FGP_NOFS;
+	if (!(flags & AOP_FLAG_UNCACHED))
+		fgp_flags |= FGP_CREAT;
 
 	page = pagecache_get_page(mapping, index, fgp_flags,
 			mapping_gfp_mask(mapping));
@@ -3301,21 +3303,67 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+/*
+ * Start writeback on the pages in pgs[], and then try and remove those pages
+ * from the page cached. Used with RWF_UNCACHED.
+ */
+void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
+			     unsigned *nr)
+{
+	loff_t start, end;
+	int i;
+
+	end = 0;
+	start = LLONG_MAX;
+	for (i = 0; i < *nr; i++) {
+		struct page *page = pgs[i];
+		loff_t off;
+
+		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
+		if (off < start)
+			start = off;
+		if (off > end)
+			end = off;
+		get_page(page);
+	}
+
+	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
+
+	for (i = 0; i < *nr; i++) {
+		struct page *page = pgs[i];
+
+		lock_page(page);
+		if (page->mapping == mapping) {
+			wait_on_page_writeback(page);
+			if (!page_has_private(page) ||
+			    try_to_release_page(page, 0))
+				remove_mapping(mapping, page);
+		}
+		unlock_page(page);
+	}
+	*nr = 0;
+}
+EXPORT_SYMBOL_GPL(write_drop_cached_pages);
+
+#define GPW_PAGE_BATCH		16
+
 ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, struct kiocb *iocb)
 {
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
+	struct page *drop_pages[GPW_PAGE_BATCH];
 	loff_t pos = iocb->ki_pos;
 	long status = 0;
 	ssize_t written = 0;
-	unsigned int flags = 0;
+	unsigned int flags = 0, drop_nr = 0;
 
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
+		bool drop_page = false;	/* drop page after IO */
 		void *fsdata;
 
 		offset = (pos & (PAGE_SIZE - 1));
@@ -3323,6 +3371,9 @@ ssize_t generic_perform_write(struct file *file,
 						iov_iter_count(i));
 
 again:
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			flags |= AOP_FLAG_UNCACHED;
+
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -3343,10 +3394,17 @@ ssize_t generic_perform_write(struct file *file,
 			break;
 		}
 
+retry:
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status < 0))
+		if (unlikely(status < 0)) {
+			if (status == -ENOMEM && (flags & AOP_FLAG_UNCACHED)) {
+				drop_page = true;
+				flags &= ~AOP_FLAG_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
@@ -3376,12 +3434,22 @@ ssize_t generic_perform_write(struct file *file,
 						iov_iter_single_seg_count(i));
 			goto again;
 		}
+		if (drop_page &&
+		    ((pos >> PAGE_SHIFT) != ((pos + copied) >> PAGE_SHIFT))) {
+			drop_pages[drop_nr] = page;
+			if (++drop_nr == GPW_PAGE_BATCH)
+				write_drop_cached_pages(drop_pages, mapping,
+								&drop_nr);
+		} else
+			balance_dirty_pages_ratelimited(mapping);
+
 		pos += copied;
 		written += copied;
-
-		balance_dirty_pages_ratelimited(mapping);
 	} while (iov_iter_count(i));
 
+	if (drop_nr)
+		write_drop_cached_pages(drop_pages, mapping, &drop_nr);
+
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_perform_write);
-- 
2.24.0
^ permalink raw reply related	[flat|nested] 20+ messages in thread
* [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (2 preceding siblings ...)
  2019-12-10 16:24 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2019-12-10 16:24 ` Jens Axboe
  2019-12-10 16:24 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 16:24 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: Jens Axboe
This is in preparation for passing in a flag to the iomap_actor, which
currently doesn't support that.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/dax.c               |  2 +-
 fs/iomap/apply.c       |  2 +-
 fs/iomap/buffered-io.c | 17 ++++++++++-------
 fs/iomap/direct-io.c   |  3 ++-
 fs/iomap/fiemap.c      |  5 +++--
 fs/iomap/seek.c        |  6 ++++--
 fs/iomap/swapfile.c    |  2 +-
 include/linux/iomap.h  |  5 +++--
 8 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..30a20b994140 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1091,7 +1091,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 76925b40b5fd..562536da8a13 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -77,7 +77,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * iomap into the actors so that they don't need to have special
 	 * handling for the two cases.
 	 */
-	written = actor(inode, pos, length, data, &iomap,
+	written = actor(inode, pos, length, data, flags, &iomap,
 			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
 
 	/*
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 828444e14d09..9b5b770ca4c7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -249,7 +249,7 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	struct page *page = ctx->cur_page;
@@ -397,7 +397,8 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 
 static loff_t
 iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
@@ -417,7 +418,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
-				ctx, iomap, srcmap);
+				ctx, 0, iomap, srcmap);
 	}
 
 	return done;
@@ -797,7 +798,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iov_iter *i = data;
 	long status = 0;
@@ -897,7 +898,7 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
 static loff_t
 iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	long status = 0;
 	ssize_t written = 0;
@@ -983,7 +984,8 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
@@ -1053,7 +1055,8 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct page *page = data;
 	int ret;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..2525997b09aa 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -365,7 +365,8 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 
 static loff_t
 iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_dio *dio = data;
 
diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
index bccf305ea9ce..04de960259d0 100644
--- a/fs/iomap/fiemap.c
+++ b/fs/iomap/fiemap.c
@@ -44,7 +44,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct fiemap_ctx *ctx = data;
 	loff_t ret = length;
@@ -111,7 +111,8 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
 
 static loff_t
 iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	sector_t *bno = data, addr;
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 89f61d93c0bc..a5cbf04e8cb3 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -119,7 +119,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 
 static loff_t
 iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_UNWRITTEN:
@@ -165,7 +166,8 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
 
 static loff_t
 iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_HOLE:
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index a648dbf6991e..774bfc3e59e1 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -76,7 +76,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
  * distinction between written and unwritten extents.
  */
 static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
-		loff_t count, void *data, struct iomap *iomap,
+		loff_t count, void *data, unsigned flags, struct iomap *iomap,
 		struct iomap *srcmap)
 {
 	struct iomap_swapfile_info *isi = data;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..61fcaa3904d4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -113,7 +113,7 @@ struct iomap_page_ops {
 };
 
 /*
- * Flags for iomap_begin / iomap_end.  No flag implies a read.
+ * Flags for iomap_begin / iomap_end / factor.  No flag implies a read.
  */
 #define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
 #define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
@@ -146,7 +146,8 @@ struct iomap_ops {
  * Main iomap iterator function.
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap, struct iomap *srcmap);
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
-- 
2.24.0
^ permalink raw reply related	[flat|nested] 20+ messages in thread
* [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (3 preceding siblings ...)
  2019-12-10 16:24 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
@ 2019-12-10 16:24 ` Jens Axboe
  2019-12-10 21:17 ` [PATCHSET 0/5] Support for RWF_UNCACHED Andreas Dilger
  2019-12-12 15:47 ` Christoph Hellwig
  6 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 16:24 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: Jens Axboe
This adds support for RWF_UNCACHED for file systems using iomap to
perform buffered writes. We use the generic infrastructure for this,
by tracking pages we created and calling write_drop_cached_pages()
to issue writeback and prune those pages.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/buffered-io.c | 58 ++++++++++++++++++++++++++++++++++--------
 include/linux/iomap.h  |  1 +
 2 files changed, 49 insertions(+), 10 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9b5b770ca4c7..c8d36b280ff2 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -566,6 +566,7 @@ EXPORT_SYMBOL_GPL(iomap_migrate_page);
 
 enum {
 	IOMAP_WRITE_F_UNSHARE		= (1 << 0),
+	IOMAP_WRITE_F_UNCACHED		= (1 << 1),
 };
 
 static void
@@ -643,6 +644,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	unsigned aop_flags;
 	struct page *page;
 	int status = 0;
 
@@ -659,8 +661,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 			return status;
 	}
 
+	aop_flags = AOP_FLAG_NOFS;
+	if (flags & IOMAP_UNCACHED)
+		aop_flags |= AOP_FLAG_UNCACHED;
 	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
-			AOP_FLAG_NOFS);
+						aop_flags);
 	if (!page) {
 		status = -ENOMEM;
 		goto out_no_page;
@@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		iomap_read_inline_data(inode, page, srcmap);
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
-	else
-		status = __iomap_write_begin(inode, pos, len, flags, page,
+	else {
+		unsigned wb_flags = 0;
+
+		if (flags & IOMAP_UNCACHED)
+			wb_flags = IOMAP_WRITE_F_UNCACHED;
+		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
 				srcmap);
+	}
 
 	if (unlikely(status))
 		goto out_unlock;
@@ -796,19 +806,25 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 	return ret;
 }
 
+#define GPW_PAGE_BATCH		16
+
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
+	struct address_space *mapping = inode->i_mapping;
+	struct page *drop_pages[GPW_PAGE_BATCH];
 	struct iov_iter *i = data;
 	long status = 0;
 	ssize_t written = 0;
+	unsigned drop_nr = 0;
 
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
+		bool drop_page = false;	/* drop page after IO */
 
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
@@ -832,10 +848,17 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
-		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
-				srcmap);
-		if (unlikely(status))
+retry:
+		status = iomap_write_begin(inode, pos, bytes, flags, &page,
+						iomap, srcmap);
+		if (unlikely(status)) {
+			if (status == -ENOMEM && (flags & IOMAP_UNCACHED)) {
+				drop_page = true;
+				flags &= ~IOMAP_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(inode->i_mapping))
 			flush_dcache_page(page);
@@ -866,13 +889,24 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 						iov_iter_single_seg_count(i));
 			goto again;
 		}
+
+		if (drop_page &&
+		    ((pos >> PAGE_SHIFT) != ((pos + copied) >> PAGE_SHIFT))) {
+			drop_pages[drop_nr] = page;
+			if (++drop_nr == GPW_PAGE_BATCH)
+				write_drop_cached_pages(drop_pages, mapping,
+								&drop_nr);
+		} else
+			balance_dirty_pages_ratelimited(inode->i_mapping);
+
 		pos += copied;
 		written += copied;
 		length -= copied;
-
-		balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
+	if (drop_nr)
+		write_drop_cached_pages(drop_pages, mapping, &drop_nr);
+
 	return written ? written : status;
 }
 
@@ -882,10 +916,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	unsigned flags = IOMAP_WRITE;
+
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		flags |= IOMAP_UNCACHED;
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter),
-				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags,
+					ops, iter, iomap_write_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 61fcaa3904d4..833dd43507ac 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -121,6 +121,7 @@ struct iomap_page_ops {
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_UNCACHED		(1 << 6)
 
 struct iomap_ops {
 	/*
-- 
2.24.0
^ permalink raw reply related	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-10 16:24 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2019-12-10 16:55   ` Matthew Wilcox
  2019-12-10 17:02     ` Jens Axboe
  2019-12-11  0:23   ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2019-12-10 16:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block
On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> +/*
> + * Start writeback on the pages in pgs[], and then try and remove those pages
> + * from the page cached. Used with RWF_UNCACHED.
> + */
> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
> +			     unsigned *nr)
It would seem more natural to use a pagevec instead of pgs/nr.
> +{
> +	loff_t start, end;
> +	int i;
> +
> +	end = 0;
> +	start = LLONG_MAX;
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +		loff_t off;
> +
> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
Isn't that page_offset()?
> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
> +
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +
> +		lock_page(page);
> +		if (page->mapping == mapping) {
So you're protecting against the page being freed and reallocated to a
different file, but not against the page being freed and reallocated to
a location in the same file which is outside (start, end)?
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-10 16:55   ` Matthew Wilcox
@ 2019-12-10 17:02     ` Jens Axboe
  2019-12-10 18:35       ` Chris Mason
  2019-12-10 18:58       ` Matthew Wilcox
  0 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 17:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-block
On 12/10/19 9:55 AM, Matthew Wilcox wrote:
> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>> +/*
>> + * Start writeback on the pages in pgs[], and then try and remove those pages
>> + * from the page cached. Used with RWF_UNCACHED.
>> + */
>> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
>> +			     unsigned *nr)
> 
> It would seem more natural to use a pagevec instead of pgs/nr.
I did look into that, but they are intertwined with LRU etc. I
deliberately avoided the LRU on the read side, as it adds noticeable
overhead and gains us nothing since the pages will be dropped agian.
>> +{
>> +	loff_t start, end;
>> +	int i;
>> +
>> +	end = 0;
>> +	start = LLONG_MAX;
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +		loff_t off;
>> +
>> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
> 
> Isn't that page_offset()?
I guess it is! I'll make that change.
>> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
>> +
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +
>> +		lock_page(page);
>> +		if (page->mapping == mapping) {
> 
> So you're protecting against the page being freed and reallocated to a
> different file, but not against the page being freed and reallocated
> to a location in the same file which is outside (start, end)?
I guess so, we can add that too, probably just check if the index is
still the same. More of a behavioral thing, shouldn't be any
correctness issues there.
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-10 17:02     ` Jens Axboe
@ 2019-12-10 18:35       ` Chris Mason
  2019-12-10 18:58       ` Matthew Wilcox
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Mason @ 2019-12-10 18:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org
On 10 Dec 2019, at 12:02, Jens Axboe wrote:
> On 12/10/19 9:55 AM, Matthew Wilcox wrote:
>> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>>> +/*
>>> + * Start writeback on the pages in pgs[], and then try and remove 
>>> those pages
>>> + * from the page cached. Used with RWF_UNCACHED.
>>> + */
>>> +void write_drop_cached_pages(struct page **pgs, struct 
>>> address_space *mapping,
>>> +			     unsigned *nr)
>>
>> It would seem more natural to use a pagevec instead of pgs/nr.
>
> I did look into that, but they are intertwined with LRU etc. I
> deliberately avoided the LRU on the read side, as it adds noticeable
> overhead and gains us nothing since the pages will be dropped agian.
>
>>> +{
>>> +	loff_t start, end;
>>> +	int i;
>>> +
>>> +	end = 0;
>>> +	start = LLONG_MAX;
>>> +	for (i = 0; i < *nr; i++) {
>>> +		struct page *page = pgs[i];
>>> +		loff_t off;
>>> +
>>> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
>>
>> Isn't that page_offset()?
>
> I guess it is! I'll make that change.
>
>>> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
>>> +
>>> +	for (i = 0; i < *nr; i++) {
>>> +		struct page *page = pgs[i];
>>> +
>>> +		lock_page(page);
>>> +		if (page->mapping == mapping) {
>>
>> So you're protecting against the page being freed and reallocated to 
>> a
>> different file, but not against the page being freed and reallocated
>> to a location in the same file which is outside (start, end)?
>
> I guess so, we can add that too, probably just check if the index is
> still the same. More of a behavioral thing, shouldn't be any
> correctness issues there.
Since we have a reference on the page, the mapping can go to NULL but 
otherwise it should stay in the same mapping at the same offset.
But, Jens and I both just realized he needs to take the reference on the 
page before write_end is called.
-chris
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-10 17:02     ` Jens Axboe
  2019-12-10 18:35       ` Chris Mason
@ 2019-12-10 18:58       ` Matthew Wilcox
  2019-12-10 19:10         ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2019-12-10 18:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block
On Tue, Dec 10, 2019 at 10:02:18AM -0700, Jens Axboe wrote:
> On 12/10/19 9:55 AM, Matthew Wilcox wrote:
> > On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> >> +/*
> >> + * Start writeback on the pages in pgs[], and then try and remove those pages
> >> + * from the page cached. Used with RWF_UNCACHED.
> >> + */
> >> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
> >> +			     unsigned *nr)
> > 
> > It would seem more natural to use a pagevec instead of pgs/nr.
> 
> I did look into that, but they are intertwined with LRU etc. I
> deliberately avoided the LRU on the read side, as it adds noticeable
> overhead and gains us nothing since the pages will be dropped agian.
I agree the LRU uses them, but they're used in all kinds of places where
we need to batch pages, eg truncate, munlock.
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-10 18:58       ` Matthew Wilcox
@ 2019-12-10 19:10         ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 19:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-block
On 12/10/19 11:58 AM, Matthew Wilcox wrote:
> On Tue, Dec 10, 2019 at 10:02:18AM -0700, Jens Axboe wrote:
>> On 12/10/19 9:55 AM, Matthew Wilcox wrote:
>>> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>>>> +/*
>>>> + * Start writeback on the pages in pgs[], and then try and remove those pages
>>>> + * from the page cached. Used with RWF_UNCACHED.
>>>> + */
>>>> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
>>>> +			     unsigned *nr)
>>>
>>> It would seem more natural to use a pagevec instead of pgs/nr.
>>
>> I did look into that, but they are intertwined with LRU etc. I
>> deliberately avoided the LRU on the read side, as it adds noticeable
>> overhead and gains us nothing since the pages will be dropped agian.
> 
> I agree the LRU uses them, but they're used in all kinds of places where
> we need to batch pages, eg truncate, munlock.
I think I just had to convince myself that the release business works
the way it should for my use case, and I think it does. I'll test it.
release_pages() isn't exactly the prettiest thing out there.
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 20+ messages in thread
* [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor
  2019-12-10 20:42 [PATCHSET v2 " Jens Axboe
@ 2019-12-10 20:43 ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-10 20:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: willy, clm, Jens Axboe
This is in preparation for passing in a flag to the iomap_actor, which
currently doesn't support that.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/dax.c               |  2 +-
 fs/iomap/apply.c       |  2 +-
 fs/iomap/buffered-io.c | 17 ++++++++++-------
 fs/iomap/direct-io.c   |  3 ++-
 fs/iomap/fiemap.c      |  5 +++--
 fs/iomap/seek.c        |  6 ++++--
 fs/iomap/swapfile.c    |  2 +-
 include/linux/iomap.h  |  5 +++--
 8 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..30a20b994140 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1091,7 +1091,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 76925b40b5fd..562536da8a13 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -77,7 +77,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * iomap into the actors so that they don't need to have special
 	 * handling for the two cases.
 	 */
-	written = actor(inode, pos, length, data, &iomap,
+	written = actor(inode, pos, length, data, flags, &iomap,
 			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
 
 	/*
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 828444e14d09..9b5b770ca4c7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -249,7 +249,7 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	struct page *page = ctx->cur_page;
@@ -397,7 +397,8 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 
 static loff_t
 iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
@@ -417,7 +418,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
-				ctx, iomap, srcmap);
+				ctx, 0, iomap, srcmap);
 	}
 
 	return done;
@@ -797,7 +798,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iov_iter *i = data;
 	long status = 0;
@@ -897,7 +898,7 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
 static loff_t
 iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	long status = 0;
 	ssize_t written = 0;
@@ -983,7 +984,8 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
@@ -1053,7 +1055,8 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct page *page = data;
 	int ret;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..2525997b09aa 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -365,7 +365,8 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 
 static loff_t
 iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_dio *dio = data;
 
diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
index bccf305ea9ce..04de960259d0 100644
--- a/fs/iomap/fiemap.c
+++ b/fs/iomap/fiemap.c
@@ -44,7 +44,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct fiemap_ctx *ctx = data;
 	loff_t ret = length;
@@ -111,7 +111,8 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
 
 static loff_t
 iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	sector_t *bno = data, addr;
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 89f61d93c0bc..a5cbf04e8cb3 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -119,7 +119,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 
 static loff_t
 iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_UNWRITTEN:
@@ -165,7 +166,8 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
 
 static loff_t
 iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_HOLE:
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index a648dbf6991e..774bfc3e59e1 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -76,7 +76,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
  * distinction between written and unwritten extents.
  */
 static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
-		loff_t count, void *data, struct iomap *iomap,
+		loff_t count, void *data, unsigned flags, struct iomap *iomap,
 		struct iomap *srcmap)
 {
 	struct iomap_swapfile_info *isi = data;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..61fcaa3904d4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -113,7 +113,7 @@ struct iomap_page_ops {
 };
 
 /*
- * Flags for iomap_begin / iomap_end.  No flag implies a read.
+ * Flags for iomap_begin / iomap_end / factor.  No flag implies a read.
  */
 #define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
 #define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
@@ -146,7 +146,8 @@ struct iomap_ops {
  * Main iomap iterator function.
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap, struct iomap *srcmap);
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
-- 
2.24.0
^ permalink raw reply related	[flat|nested] 20+ messages in thread
* Re: [PATCHSET 0/5] Support for RWF_UNCACHED
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (4 preceding siblings ...)
  2019-12-10 16:24 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
@ 2019-12-10 21:17 ` Andreas Dilger
  2019-12-12 15:47 ` Christoph Hellwig
  6 siblings, 0 replies; 20+ messages in thread
From: Andreas Dilger @ 2019-12-10 21:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, Linux FS-devel Mailing List, linux-block
[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]
On Dec 10, 2019, at 9:24 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Recently someone asked me how io_uring buffered IO compares to mmaped
> IO in terms of performance. So I ran some tests with buffered IO, and
> found the experience to be somewhat painful. The test case is pretty
> basic, random reads over a dataset that's 10x the size of RAM.
> Performance starts out fine, and then the page cache fills up and we
> hit a throughput cliff. CPU usage of the IO threads go up, and we have
> kswapd spending 100% of a core trying to keep up. Seeing that, I was
> reminded of the many complaints I here about buffered IO, and the fact
> that most of the folks complaining will ultimately bite the bullet and
> move to O_DIRECT to just get the kernel out of the way.
> 
> But I don't think it needs to be like that. Switching to O_DIRECT isn't
> always easily doable. The buffers have different life times, size and
> alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
> can be painful.
> 
> Seems to me that we have an opportunity to provide something that sits
> somewhere in between buffered and O_DIRECT, and this is where
> RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
> following behavior:
> 
> - If the data is in cache, it remains in cache and the copy (in or out)
>  is served to/from that.
> 
> - If the data is NOT in cache, we add it while performing the IO. When
>  the IO is done, we remove it again.
> 
> With this, I can do 100% smooth buffered reads or writes without pushing
> the kernel to the state where kswapd is sweating bullets. In fact it
> doesn't even register.
> 
> Comments appreciated!
I think this is a definite win for e.g. NVMe/Optane devices where the
underlying storage is fast enough to avoid the need for page cache.
In our testing of Lustre on NVMe, it was faster to avoid the page cache
entirely - just inserting and removing the pages from cache took a
considerable amount of CPU for workloads where we knew it was not
beneficial (e.g. IO that was large enough that the storage was as fast
as the network).
This also makes it easier to keep other data in cache (e.g. filesystem
metadata, small IOs, etc.).
Cheers, Andreas
> Patches are against current git (ish), and can also be found here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached
> 
> fs/ceph/file.c          |   2 +-
> fs/dax.c                |   2 +-
> fs/ext4/file.c          |   2 +-
> fs/iomap/apply.c        |   2 +-
> fs/iomap/buffered-io.c  |  75 +++++++++++++++++------
> fs/iomap/direct-io.c    |   3 +-
> fs/iomap/fiemap.c       |   5 +-
> fs/iomap/seek.c         |   6 +-
> fs/iomap/swapfile.c     |   2 +-
> fs/nfs/file.c           |   2 +-
> include/linux/fs.h      |   9 ++-
> include/linux/iomap.h   |   6 +-
> include/uapi/linux/fs.h |   5 +-
> mm/filemap.c            | 132 ++++++++++++++++++++++++++++++++++++----
> 14 files changed, 208 insertions(+), 45 deletions(-)
> 
> --
> Jens Axboe
> 
> 
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-10 16:24 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
  2019-12-10 16:55   ` Matthew Wilcox
@ 2019-12-11  0:23   ` Dave Chinner
  2019-12-11  0:28     ` Dave Chinner
  2019-12-11 14:39     ` Jens Axboe
  1 sibling, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2019-12-11  0:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block
On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
> cache instantiated for buffered writes. If new pages aren't
> instantiated, we leave them alone. This provides similar semantics to
> reads with RWF_UNCACHED set.
So what about filesystems that don't use generic_perform_write()?
i.e. Anything that uses the iomap infrastructure (i.e.
iomap_file_buffered_write()) instead of generic_file_write_iter())
will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now,
but there are likely to be more in the near future as more
filesystems are ported to the iomap infrastructure.
I'd also really like to see extensive fsx and fstress testing of
this new IO mode before it is committed - this is going to exercise page
cache coherency across different operations in new and unique
ways. that means we need patches to fstests to detect and use this
functionality when available, and new tests that explicitly exercise
combinations of buffered, mmap, dio and uncached for a range of
different IO size and alignments (e.g. mixing sector sized uncached
IO with page sized buffered/mmap/dio and vice versa).
We are not going to have a repeat of the copy_file_range() data
corruption fuckups because no testing was done and no test
infrastructure was written before the new API was committed.
> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
> +			     unsigned *nr)
> +{
> +	loff_t start, end;
> +	int i;
> +
> +	end = 0;
> +	start = LLONG_MAX;
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +		loff_t off;
> +
> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
> +		if (off < start)
> +			start = off;
> +		if (off > end)
> +			end = off;
> +		get_page(page);
> +	}
> +
> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
> +
> +	for (i = 0; i < *nr; i++) {
> +		struct page *page = pgs[i];
> +
> +		lock_page(page);
> +		if (page->mapping == mapping) {
> +			wait_on_page_writeback(page);
> +			if (!page_has_private(page) ||
> +			    try_to_release_page(page, 0))
> +				remove_mapping(mapping, page);
> +		}
> +		unlock_page(page);
> +	}
> +	*nr = 0;
> +}
> +EXPORT_SYMBOL_GPL(write_drop_cached_pages);
> +
> +#define GPW_PAGE_BATCH		16
In terms of performance, file fragmentation and premature filesystem
aging, this is also going to suck *really badly* for filesystems
that use delayed allocation because it is going to force conversion
of delayed allocation extents during the write() call. IOWs,
it adds all the overheads of doing delayed allocation, but it reaps
none of the benefits because it doesn't allow large contiguous
extents to build up in memory before physical allocation occurs.
i.e. there is no "delayed" in this allocation....
So it might work fine on a pristine, empty filesystem where it is
easy to find contiguous free space accross multiple allocations, but
it's going to suck after a few months of production usage has
fragmented all the free space into tiny pieces...
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-11  0:23   ` Dave Chinner
@ 2019-12-11  0:28     ` Dave Chinner
  2019-12-11 14:39     ` Jens Axboe
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2019-12-11  0:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block
On Wed, Dec 11, 2019 at 11:23:49AM +1100, Dave Chinner wrote:
> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
> > If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
> > cache instantiated for buffered writes. If new pages aren't
> > instantiated, we leave them alone. This provides similar semantics to
> > reads with RWF_UNCACHED set.
> 
> So what about filesystems that don't use generic_perform_write()?
> i.e. Anything that uses the iomap infrastructure (i.e.
> iomap_file_buffered_write()) instead of generic_file_write_iter())
> will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now,
> but there are likely to be more in the near future as more
> filesystems are ported to the iomap infrastructure.
Hmmm - I'm missing part of this patchset, and I see a second version
has been posted that has iomap stuff in it. I'll go look at that
now...
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-11  0:23   ` Dave Chinner
  2019-12-11  0:28     ` Dave Chinner
@ 2019-12-11 14:39     ` Jens Axboe
  1 sibling, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-11 14:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block
On 12/10/19 5:23 PM, Dave Chinner wrote:
> On Tue, Dec 10, 2019 at 09:24:52AM -0700, Jens Axboe wrote:
>> If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
>> cache instantiated for buffered writes. If new pages aren't
>> instantiated, we leave them alone. This provides similar semantics to
>> reads with RWF_UNCACHED set.
> 
> So what about filesystems that don't use generic_perform_write()?
> i.e. Anything that uses the iomap infrastructure (i.e.
> iomap_file_buffered_write()) instead of generic_file_write_iter())
> will currently ignore RWF_UNCACHED. That's XFS and gfs2 right now,
> but there are likely to be more in the near future as more
> filesystems are ported to the iomap infrastructure.
I'll skip this one as you found it.
> I'd also really like to see extensive fsx and fstress testing of
> this new IO mode before it is committed - this is going to exercise page
> cache coherency across different operations in new and unique
> ways. that means we need patches to fstests to detect and use this
> functionality when available, and new tests that explicitly exercise
> combinations of buffered, mmap, dio and uncached for a range of
> different IO size and alignments (e.g. mixing sector sized uncached
> IO with page sized buffered/mmap/dio and vice versa).
> 
> We are not going to have a repeat of the copy_file_range() data
> corruption fuckups because no testing was done and no test
> infrastructure was written before the new API was committed.
Oh I totally agree, and there's no push from my end on this. I just
think it's a cool feature and could be very useful, but it obviously
needs a healthy dose of testing and test cases written. I'll be doing
that as well.
>> +void write_drop_cached_pages(struct page **pgs, struct address_space *mapping,
>> +			     unsigned *nr)
>> +{
>> +	loff_t start, end;
>> +	int i;
>> +
>> +	end = 0;
>> +	start = LLONG_MAX;
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +		loff_t off;
>> +
>> +		off = (loff_t) page_to_index(page) << PAGE_SHIFT;
>> +		if (off < start)
>> +			start = off;
>> +		if (off > end)
>> +			end = off;
>> +		get_page(page);
>> +	}
>> +
>> +	__filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE);
>> +
>> +	for (i = 0; i < *nr; i++) {
>> +		struct page *page = pgs[i];
>> +
>> +		lock_page(page);
>> +		if (page->mapping == mapping) {
>> +			wait_on_page_writeback(page);
>> +			if (!page_has_private(page) ||
>> +			    try_to_release_page(page, 0))
>> +				remove_mapping(mapping, page);
>> +		}
>> +		unlock_page(page);
>> +	}
>> +	*nr = 0;
>> +}
>> +EXPORT_SYMBOL_GPL(write_drop_cached_pages);
>> +
>> +#define GPW_PAGE_BATCH		16
> 
> In terms of performance, file fragmentation and premature filesystem
> aging, this is also going to suck *really badly* for filesystems
> that use delayed allocation because it is going to force conversion
> of delayed allocation extents during the write() call. IOWs,
> it adds all the overheads of doing delayed allocation, but it reaps
> none of the benefits because it doesn't allow large contiguous
> extents to build up in memory before physical allocation occurs.
> i.e. there is no "delayed" in this allocation....
> 
> So it might work fine on a pristine, empty filesystem where it is
> easy to find contiguous free space accross multiple allocations, but
> it's going to suck after a few months of production usage has
> fragmented all the free space into tiny pieces...
I totally agree on this one, and I'm not a huge fan of it. But
considering your suggestion in the other email, I think we just need to
move this up a notch and do it per-write instead. If we can pass back
information about the state of the page cache for the range we care
about, then there's no reason to do it per-page for the write case.
Reads are still best done that way, and we can avoid the LRU overhead by
doing it that way.
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 20+ messages in thread
* [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
@ 2019-12-11 15:29 ` Jens Axboe
  2019-12-11 17:19   ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe
This is in preparation for passing in a flag to the iomap_actor, which
currently doesn't support that.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/dax.c               |  2 +-
 fs/iomap/apply.c       |  2 +-
 fs/iomap/buffered-io.c | 17 ++++++++++-------
 fs/iomap/direct-io.c   |  3 ++-
 fs/iomap/fiemap.c      |  5 +++--
 fs/iomap/seek.c        |  6 ++++--
 fs/iomap/swapfile.c    |  2 +-
 include/linux/iomap.h  |  5 +++--
 8 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..30a20b994140 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1091,7 +1091,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 76925b40b5fd..562536da8a13 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -77,7 +77,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * iomap into the actors so that they don't need to have special
 	 * handling for the two cases.
 	 */
-	written = actor(inode, pos, length, data, &iomap,
+	written = actor(inode, pos, length, data, flags, &iomap,
 			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
 
 	/*
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 828444e14d09..9b5b770ca4c7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -249,7 +249,7 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	struct page *page = ctx->cur_page;
@@ -397,7 +397,8 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 
 static loff_t
 iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
@@ -417,7 +418,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
-				ctx, iomap, srcmap);
+				ctx, 0, iomap, srcmap);
 	}
 
 	return done;
@@ -797,7 +798,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iov_iter *i = data;
 	long status = 0;
@@ -897,7 +898,7 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
 static loff_t
 iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	long status = 0;
 	ssize_t written = 0;
@@ -983,7 +984,8 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
@@ -1053,7 +1055,8 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct page *page = data;
 	int ret;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..2525997b09aa 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -365,7 +365,8 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 
 static loff_t
 iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_dio *dio = data;
 
diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
index bccf305ea9ce..04de960259d0 100644
--- a/fs/iomap/fiemap.c
+++ b/fs/iomap/fiemap.c
@@ -44,7 +44,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct fiemap_ctx *ctx = data;
 	loff_t ret = length;
@@ -111,7 +111,8 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
 
 static loff_t
 iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	sector_t *bno = data, addr;
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 89f61d93c0bc..a5cbf04e8cb3 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -119,7 +119,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 
 static loff_t
 iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_UNWRITTEN:
@@ -165,7 +166,8 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
 
 static loff_t
 iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_HOLE:
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index a648dbf6991e..774bfc3e59e1 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -76,7 +76,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
  * distinction between written and unwritten extents.
  */
 static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
-		loff_t count, void *data, struct iomap *iomap,
+		loff_t count, void *data, unsigned flags, struct iomap *iomap,
 		struct iomap *srcmap)
 {
 	struct iomap_swapfile_info *isi = data;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..61fcaa3904d4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -113,7 +113,7 @@ struct iomap_page_ops {
 };
 
 /*
- * Flags for iomap_begin / iomap_end.  No flag implies a read.
+ * Flags for iomap_begin / iomap_end / factor.  No flag implies a read.
  */
 #define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
 #define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
@@ -146,7 +146,8 @@ struct iomap_ops {
  * Main iomap iterator function.
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap, struct iomap *srcmap);
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
-- 
2.24.0
^ permalink raw reply related	[flat|nested] 20+ messages in thread
* Re: [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor
  2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
@ 2019-12-11 17:19   ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2019-12-11 17:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox, Chris Mason,
	Dave Chinner
On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> This is in preparation for passing in a flag to the iomap_actor, which
> currently doesn't support that.
This really looks like we should use a struct for passing the arguments, no?
Now on 64-bit, you the iomap_actor() has seven arguments, which
already means that it's passing some of them on the stack on most
architectures.
On 32-bit, it's even worse, because two of the arguments are "loff_t",
which means that they are 2 words each, so you have 9 words of
arguments. I don't know a single architecture that does register
passing for things like that.
If you were to change the calling convention _first_ to do a "struct
iomap_actor" or whatever, then adding the "flags" field would be a
trivial addition.
               Linus
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCHSET 0/5] Support for RWF_UNCACHED
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (5 preceding siblings ...)
  2019-12-10 21:17 ` [PATCHSET 0/5] Support for RWF_UNCACHED Andreas Dilger
@ 2019-12-12 15:47 ` Christoph Hellwig
  2019-12-12 15:52   ` Jens Axboe
  6 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2019-12-12 15:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block
On Tue, Dec 10, 2019 at 09:24:49AM -0700, Jens Axboe wrote:
> Seems to me that we have an opportunity to provide something that sits
> somewhere in between buffered and O_DIRECT, and this is where
> RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
> following behavior:
> 
> - If the data is in cache, it remains in cache and the copy (in or out)
>   is served to/from that.
> 
> - If the data is NOT in cache, we add it while performing the IO. When
>   the IO is done, we remove it again.
> 
> With this, I can do 100% smooth buffered reads or writes without pushing
> the kernel to the state where kswapd is sweating bullets. In fact it
> doesn't even register.
> 
> Comments appreciated! Patches are against current git (ish), and can
> also be found here:
I can't say I particularly like the model, as it still has all the
page cache overhead.  Direct I/O with bounce buffers for unaligned I/O
sounds simpler and faster to me.
^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [PATCHSET 0/5] Support for RWF_UNCACHED
  2019-12-12 15:47 ` Christoph Hellwig
@ 2019-12-12 15:52   ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2019-12-12 15:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-mm, linux-fsdevel, linux-block
On 12/12/19 8:47 AM, Christoph Hellwig wrote:
> On Tue, Dec 10, 2019 at 09:24:49AM -0700, Jens Axboe wrote:
>> Seems to me that we have an opportunity to provide something that sits
>> somewhere in between buffered and O_DIRECT, and this is where
>> RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
>> following behavior:
>>
>> - If the data is in cache, it remains in cache and the copy (in or out)
>>   is served to/from that.
>>
>> - If the data is NOT in cache, we add it while performing the IO. When
>>   the IO is done, we remove it again.
>>
>> With this, I can do 100% smooth buffered reads or writes without pushing
>> the kernel to the state where kswapd is sweating bullets. In fact it
>> doesn't even register.
>>
>> Comments appreciated! Patches are against current git (ish), and can
>> also be found here:
> 
> I can't say I particularly like the model, as it still has all the
> page cache overhead.  Direct I/O with bounce buffers for unaligned I/O
> sounds simpler and faster to me.
The current patchset read side does not, hopefully the same can be done
on the write side. No page cache usage for reads, because it did indeed
turn out to have too much overhead.
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 20+ messages in thread
end of thread, other threads:[~2019-12-12 15:52 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-10 16:24 ` [PATCH 1/5] fs: add read support " Jens Axboe
2019-12-10 16:24 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
2019-12-10 16:24 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-10 16:55   ` Matthew Wilcox
2019-12-10 17:02     ` Jens Axboe
2019-12-10 18:35       ` Chris Mason
2019-12-10 18:58       ` Matthew Wilcox
2019-12-10 19:10         ` Jens Axboe
2019-12-11  0:23   ` Dave Chinner
2019-12-11  0:28     ` Dave Chinner
2019-12-11 14:39     ` Jens Axboe
2019-12-10 16:24 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-10 16:24 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-10 21:17 ` [PATCHSET 0/5] Support for RWF_UNCACHED Andreas Dilger
2019-12-12 15:47 ` Christoph Hellwig
2019-12-12 15:52   ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-12-10 20:42 [PATCHSET v2 " Jens Axboe
2019-12-10 20:43 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-11 17:19   ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).