linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list)
@ 2023-02-09 10:29 David Howells
  2023-02-09 10:29 ` [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE David Howells
                   ` (12 more replies)
  0 siblings, 13 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm

Hi Jens, Al, Christoph,

Here are patches to provide support for extracting pages from an iov_iter
and to use this in the extraction functions in the block layer bio code.

The patches make the following changes:

 (1) Change generic_file_splice_read() to no longer use ITER_PIPE for doing
     a read from an O_DIRECT file fd, but rather load up an ITER_BVEC
     iterator with sufficient pages and use that rather than using an
     ITER_PIPE.  This avoids a problem[2] when __iomap_dio_rw() calls
     iov_iter_revert() to shorten an iterator when it races with
     truncation.  The reversion causes the pipe iterator to prematurely
     release the pages it was retaining - despite the read still being in
     progress.  This caused memory corruption.

 (2) Change generic_file_splice_read() to no longer use ITER_PIPE for doing
     a read from a buffered file fd, but rather get pages directly from the
     pagecache using filemap_get_pages() do all the readahead, reading,
     waiting and extraction, and then feed the pages directly into the
     pipe.

 (3) filemap_get_pages() is altered so that it doesn't take an iterator
     (which we don't have in (2)), but rather the count and a flag
     indicating if we can handle partially uptodate pages are passed in and
     down to its subsidiary functions.

 (4) Remove ITER_PIPE and its paraphernalia as generic_file_splice_read()
     was the only user.

 (5) Add a function, iov_iter_extract_pages() to replace
     iov_iter_get_pages*() that gets refs, pins or just lists the pages as
     appropriate to the iterator type.

     Add a function, iov_iter_extract_will_pin() that will indicate from
     the iterator type how the cleanup is to be performed, returning true
     if the pages will need unpinning, false otherwise.

 (6) Make the bio struct carry a pair of flags to indicate the cleanup
     mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating
     FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used)
     is added.

     BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it
     and this series does not fully address that file.

 (7) Add a function, bio_release_page(), to release a page appropriately to
     the cleanup mode indicated by the BIO_PAGE_* flags.

 (8) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
     pages appropriately and clean them up later.

 (9) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Changes:
========
ver #13)
 - Only use allocation in advance and ITER_BVEC for DIO read-splice.
 - Make buffered read-splice get pages directly from the pagecache.
 - Alter filemap_get_pages() & co. so that it doesn't need an iterator.

ver #12)
 - Added the missing __bitwise on the iov_iter_extraction_t typedef.
 - Rebased on -rc7.
 - Don't specify FOLL_PIN to pin_user_pages_fast().
 - Inserted patch at front to fix race between DIO read and truncation that
   caused memory corruption when iov_iter_revert() got called on an
   ITER_PIPE iterator[2].
 - Inserted a patch after that to remove the now-unused ITER_PIPE and its
   helper functions.
 - Removed the ITER_PIPE bits from iov_iter_extract_pages().

ver #11)
 - Fix iov_iter_extract_kvec_pages() to include the offset into the page in
   the returned starting offset.
 - Use __bitwise for the extraction flags

ver #10)
 - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
 - Drop bio_set_cleanup_mode(), open coding it instead.

ver #9)
 - It's now not permitted to use FOLL_PIN outside of mm/, so:
 - Change iov_iter_extract_mode() into iov_iter_extract_will_pin() and
   return true/false instead of FOLL_PIN/0.
 - Drop of folio_put_unpin() and page_put_unpin() and instead call
   unpin_user_page() (and put_page()) directly as necessary.
 - Make __bio_release_pages() call bio_release_page() instead of
   unpin_user_page() as there's no BIO_* -> FOLL_* translation to do.
 - Drop the FOLL_* renumbering patch.
 - Change extract_flags to extraction_flags.

ver #8)
 - Import Christoph Hellwig's changes.
   - Split the conversion-to-extraction patch.
   - Drop the extract_flags arg from iov_iter_extract_mode().
   - Don't default bios to BIO_PAGE_REFFED, but set explicitly.
 - Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0.
 - Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0.
 - We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the
   patches for that.

ver #7)
 - For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
   as it turned out to be a lot more complicated, with places not setting
   IOCB_WRITE when they should, for example.
 - Drop all the patches that changed things other then the block layer's
   bio handling.  The netfslib and cifs changes can go into a separate
   patchset.
 - Add support for extracting pages from KVEC-type iterators.
 - When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
 - Fix write() syscall and co. not setting IOCB_WRITE.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the iterator direction checks from smbd_recv().
 - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
   gup_flags to iov_iter_get/extract_pages*().
 - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Provide a helper to clean up a page.
 - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
   the same numerical values, enforced with an assertion.
 - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
   NFS.
 - Added in the patches to make CIFS do top-to-bottom iterators and use
   various of the added extraction functions.
 - Added a pair of work-in-progess patches to make sk_buff fragments store
   FOLL_GET and FOLL_PIN.

ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6
Link: https://lore.kernel.org/r/20230120175556.3556978-1-dhowells@redhat.com/ # v7
Link: https://lore.kernel.org/r/20230123173007.325544-1-dhowells@redhat.com/ # v8
Link: https://lore.kernel.org/r/20230124170108.1070389-1-dhowells@redhat.com/ # v9
Link: https://lore.kernel.org/r/20230125210657.2335748-1-dhowells@redhat.com/ # v10
Link: https://lore.kernel.org/r/20230126141626.2809643-1-dhowells@redhat.com/ # v11
Link: https://lore.kernel.org/r/20230207171305.3716974-1-dhowells@redhat.com/ # v12

Christoph Hellwig (1):
  block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted
    logic

David Howells (11):
  splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  mm: Pass info, not iter, into filemap_get_pages() and unstatic it
  splice: Do splice read from a buffered file without using ITER_PIPE
  iov_iter: Kill ITER_PIPE
  iov_iter: Define flags to qualify page extraction.
  iov_iter: Add a function to extract a page list from an iterator
  iomap: Don't get an reference on ZERO_PAGE for direct I/O block
    zeroing
  block: Fix bio_flagged() so that gcc can better optimise it
  block: Add BIO_PAGE_PINNED and associated infrastructure
  block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
  block: convert bio_map_user_iov to use iov_iter_extract_pages

 block/bio.c               |  33 +-
 block/blk-map.c           |  26 +-
 block/blk.h               |  12 +
 fs/cifs/file.c            |   8 +-
 fs/direct-io.c            |   2 +
 fs/iomap/direct-io.c      |   1 -
 fs/splice.c               | 245 ++++++++++++-
 include/linux/bio.h       |   5 +-
 include/linux/blk_types.h |   3 +-
 include/linux/pagemap.h   |   2 +
 include/linux/uio.h       |  49 ++-
 lib/iov_iter.c            | 713 +++++++++++++++-----------------------
 mm/filemap.c              |  30 +-
 13 files changed, 603 insertions(+), 526 deletions(-)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 14:53   ` Matthew Wilcox
                     ` (2 more replies)
  2023-02-09 10:29 ` [PATCH v13 02/12] mm: Pass info, not iter, into filemap_get_pages() and unstatic it David Howells
                   ` (11 subsequent siblings)
  12 siblings, 3 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

With the upcoming iov_iter_extract_pages() function, pages extracted from a
non-user-backed iterator such as ITER_PIPE aren't pinned.
__iomap_dio_rw(), however, calls iov_iter_revert() to shorten the iterator
to just the bufferage it is going to use - which has the side-effect of
freeing the excess pipe buffers, even though they're attached to a bio and
may get written to by DMA (thanks to Hillf Danton for spotting this[1]).

This then causes memory corruption that is particularly noticable when the
syzbot test[2] is run.  The test boils down to:

	out = creat(argv[1], 0666);
	ftruncate(out, 0x800);
	lseek(out, 0x200, SEEK_SET);
	in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
	sendfile(out, in, NULL, 0x1dd00);

run repeatedly in parallel.  What I think is happening is that ftruncate()
occasionally shortens the DIO read that's about to be made by sendfile's
splice core by reducing i_size.

Fix this by splitting the handling of a splice from an O_DIRECT file fd off
from that of non-DIO and in this case, replacing the use of an ITER_PIPE
iterator with an ITER_BVEC iterator for which reversion won't free the
buffers.  The DIO-specific code bulk allocates all the buffers it thinks it
is going to use in advance, does the read synchronously and only then trims
the buffer down.  The pages we did use get pushed into the pipe.

This should be more efficient for DIO read by virtue of doing a bulk page
allocation, but slightly less efficient by ignoring any partial page in the
pipe.

Fixes: 920756a3306a ("block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages")
Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
---

Notes:
    ver #13)
     - Don't completely replace generic_file_splice_read(), but rather only use
       this if we're doing a splicing from an O_DIRECT file fd.

 fs/splice.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 5969b7a1d353..b4be6fc314a1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -282,6 +282,99 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
+/*
+ * Splice data from an O_DIRECT file into pages and then add them to the output
+ * pipe.
+ */
+static ssize_t generic_file_direct_splice_read(struct file *in, loff_t *ppos,
+					       struct pipe_inode_info *pipe,
+					       size_t len, unsigned int flags)
+{
+	LIST_HEAD(pages);
+	struct iov_iter to;
+	struct bio_vec *bv;
+	struct kiocb kiocb;
+	struct page *page;
+	unsigned int head;
+	ssize_t ret;
+	size_t used, npages, chunk, remain, reclaim;
+	int i;
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+	npages = DIV_ROUND_UP(len, PAGE_SIZE);
+
+	bv = kmalloc(array_size(npages, sizeof(bv[0])), GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
+
+	npages = alloc_pages_bulk_list(GFP_USER, npages, &pages);
+	if (!npages) {
+		kfree(bv);
+		return -ENOMEM;
+	}
+
+	remain = len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	for (i = 0; i < npages; i++) {
+		chunk = min_t(size_t, PAGE_SIZE, remain);
+		page = list_first_entry(&pages, struct page, lru);
+		list_del_init(&page->lru);
+		bv[i].bv_page = page;
+		bv[i].bv_offset = 0;
+		bv[i].bv_len = chunk;
+		remain -= chunk;
+	}
+
+	/* Do the I/O */
+	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
+	init_sync_kiocb(&kiocb, in);
+	kiocb.ki_pos = *ppos;
+	ret = call_read_iter(in, &kiocb, &to);
+
+	reclaim = npages * PAGE_SIZE;
+	remain = 0;
+	if (ret > 0) {
+		reclaim -= ret;
+		remain = ret;
+		*ppos = kiocb.ki_pos;
+		file_accessed(in);
+	} else if (ret < 0) {
+		/*
+		 * callers of ->splice_read() expect -EAGAIN on
+		 * "can't put anything in there", rather than -EFAULT.
+		 */
+		if (ret == -EFAULT)
+			ret = -EAGAIN;
+	}
+
+	/* Free any pages that didn't get touched at all. */
+	for (; reclaim >= PAGE_SIZE; reclaim -= PAGE_SIZE)
+		__free_page(bv[--npages].bv_page);
+
+	/* Push the remaining pages into the pipe. */
+	head = pipe->head;
+	for (i = 0; i < npages; i++) {
+		struct pipe_buffer *buf = &pipe->bufs[head & (pipe->ring_size - 1)];
+
+		chunk = min_t(size_t, remain, PAGE_SIZE);
+		*buf = (struct pipe_buffer) {
+			.ops	= &default_pipe_buf_ops,
+			.page	= bv[i].bv_page,
+			.offset	= 0,
+			.len	= chunk,
+		};
+		head++;
+		remain -= chunk;
+	}
+	pipe->head = head;
+
+	kfree(bv);
+	return ret;
+}
+
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -303,6 +396,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	struct kiocb kiocb;
 	int ret;
 
+	if (in->f_flags & O_DIRECT)
+		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
+
 	iov_iter_pipe(&to, ITER_DEST, pipe, len);
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 02/12] mm: Pass info, not iter, into filemap_get_pages() and unstatic it
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
  2023-02-09 10:29 ` [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-13  8:22   ` Christoph Hellwig
  2023-02-09 10:29 ` [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE David Howells
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

filemap_get_pages() and a number of functions that it calls take an
iterator to provide two things: the number of bytes to be got from the file
specified and whether partially uptodate pages are allowed.  Change these
functions so that this information is passed in directly.  This allows it
to be called without having an iterator to hand.

Also make filemap_get_pages() available so that it can be used by a later
patch to fix splicing from a buffered file.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            | 31 ++++++++++++++++++-------------
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 29e1f9e76eb6..3a7bdb35acff 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -748,6 +748,8 @@ struct page *read_cache_page(struct address_space *, pgoff_t index,
 		filler_t *filler, struct file *file);
 extern struct page * read_cache_page_gfp(struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
+int filemap_get_pages(struct kiocb *iocb, size_t count,
+		      struct folio_batch *fbatch, bool need_uptodate);
 
 static inline struct page *read_mapping_page(struct address_space *mapping,
 				pgoff_t index, struct file *file)
diff --git a/mm/filemap.c b/mm/filemap.c
index c4d4ace9cc70..b31168a9bafd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2440,21 +2440,19 @@ static int filemap_read_folio(struct file *file, filler_t filler,
 }
 
 static bool filemap_range_uptodate(struct address_space *mapping,
-		loff_t pos, struct iov_iter *iter, struct folio *folio)
+		loff_t pos, size_t count, struct folio *folio,
+		bool need_uptodate)
 {
-	int count;
-
 	if (folio_test_uptodate(folio))
 		return true;
 	/* pipes can't handle partially uptodate pages */
-	if (iov_iter_is_pipe(iter))
+	if (need_uptodate)
 		return false;
 	if (!mapping->a_ops->is_partially_uptodate)
 		return false;
 	if (mapping->host->i_blkbits >= folio_shift(folio))
 		return false;
 
-	count = iter->count;
 	if (folio_pos(folio) > pos) {
 		count -= folio_pos(folio) - pos;
 		pos = 0;
@@ -2466,8 +2464,8 @@ static bool filemap_range_uptodate(struct address_space *mapping,
 }
 
 static int filemap_update_page(struct kiocb *iocb,
-		struct address_space *mapping, struct iov_iter *iter,
-		struct folio *folio)
+		struct address_space *mapping, size_t count,
+		struct folio *folio, bool need_uptodate)
 {
 	int error;
 
@@ -2501,7 +2499,8 @@ static int filemap_update_page(struct kiocb *iocb,
 		goto unlock;
 
 	error = 0;
-	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, folio))
+	if (filemap_range_uptodate(mapping, iocb->ki_pos, count, folio,
+				   need_uptodate))
 		goto unlock;
 
 	error = -EAGAIN;
@@ -2577,8 +2576,12 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
 	return 0;
 }
 
-static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
-		struct folio_batch *fbatch)
+/*
+ * Extract some folios from the pagecache of a file, reading those pages from
+ * the backing store if necessary and waiting for them.
+ */
+int filemap_get_pages(struct kiocb *iocb, size_t count,
+		      struct folio_batch *fbatch, bool need_uptodate)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
@@ -2588,7 +2591,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
 	struct folio *folio;
 	int err = 0;
 
-	last_index = DIV_ROUND_UP(iocb->ki_pos + iter->count, PAGE_SIZE);
+	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -2621,7 +2624,8 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
 		if ((iocb->ki_flags & IOCB_WAITQ) &&
 		    folio_batch_count(fbatch) > 1)
 			iocb->ki_flags |= IOCB_NOWAIT;
-		err = filemap_update_page(iocb, mapping, iter, folio);
+		err = filemap_update_page(iocb, mapping, count, folio,
+					  need_uptodate);
 		if (err)
 			goto err;
 	}
@@ -2691,7 +2695,8 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
 			break;
 
-		error = filemap_get_pages(iocb, iter, &fbatch);
+		error = filemap_get_pages(iocb, iter->count, &fbatch,
+					  iov_iter_is_pipe(iter));
 		if (error < 0)
 			break;
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
  2023-02-09 10:29 ` [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE David Howells
  2023-02-09 10:29 ` [PATCH v13 02/12] mm: Pass info, not iter, into filemap_get_pages() and unstatic it David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-13  8:28   ` Christoph Hellwig
                     ` (3 more replies)
  2023-02-09 10:29 ` [PATCH v13 04/12] iov_iter: Kill ITER_PIPE David Howells
                   ` (9 subsequent siblings)
  12 siblings, 4 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Provide a function to do splice read from a buffered file, pulling the
folios out of the pagecache directly by calling filemap_get_pages() to do
any required reading and then pasting the returned folios into the pipe.

A helper function is provided to do the actual folio pasting and will
handle multipage folios by splicing as many of the relevant subpages as
will fit into the pipe.

The ITER_BVEC-based splicing previously added is then only used for
splicing from O_DIRECT files.

The code is loosely based on filemap_read() and might belong in
mm/filemap.c with that as it needs to use filemap_get_pages().

With this, ITER_PIPE is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/splice.c | 159 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 135 insertions(+), 24 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index b4be6fc314a1..963cbf20abc8 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -22,6 +22,7 @@
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/splice.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
@@ -375,6 +376,135 @@ static ssize_t generic_file_direct_splice_read(struct file *in, loff_t *ppos,
 	return ret;
 }
 
+/*
+ * Splice subpages from a folio into a pipe.
+ */
+static size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
+				     struct folio *folio,
+				     loff_t fpos, size_t size)
+{
+	struct page *page;
+	size_t spliced = 0, offset = offset_in_folio(folio, fpos);
+
+	page = folio_page(folio, offset / PAGE_SIZE);
+	size = min(size, folio_size(folio) - offset);
+	offset %= PAGE_SIZE;
+
+	while (spliced < size &&
+	       !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		struct pipe_buffer *buf = &pipe->bufs[pipe->head & (pipe->ring_size - 1)];
+		size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);
+
+		*buf = (struct pipe_buffer) {
+			.ops	= &page_cache_pipe_buf_ops,
+			.page	= page,
+			.offset	= offset,
+			.len	= part,
+		};
+		folio_get(folio);
+		pipe->head++;
+		page++;
+		spliced += part;
+		offset = 0;
+	}
+
+	return spliced;
+}
+
+/*
+ * Splice folios from the pagecache of a buffered (ie. non-O_DIRECT) file into
+ * a pipe.
+ */
+static ssize_t generic_file_buffered_splice_read(struct file *in, loff_t *ppos,
+						 struct pipe_inode_info *pipe,
+						 size_t len,
+						 unsigned int flags)
+{
+	struct folio_batch fbatch;
+	size_t total_spliced = 0, used, npages;
+	loff_t isize, end_offset;
+	bool writably_mapped;
+	int i, error = 0;
+
+	struct kiocb iocb = {
+		.ki_filp	= in,
+		.ki_pos		= *ppos,
+	};
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	folio_batch_init(&fbatch);
+
+	do {
+		cond_resched();
+
+		if (*ppos >= i_size_read(file_inode(in)))
+			break;
+
+		iocb.ki_pos = *ppos;
+		error = filemap_get_pages(&iocb, len, &fbatch, true);
+		if (error < 0)
+			break;
+
+		/*
+		 * i_size must be checked after we know the pages are Uptodate.
+		 *
+		 * Checking i_size after the check allows us to calculate
+		 * the correct value for "nr", which means the zero-filled
+		 * part of the page is not copied back to userspace (unless
+		 * another truncate extends the file - this is desired though).
+		 */
+		isize = i_size_read(file_inode(in));
+		if (unlikely(*ppos >= isize))
+			break;
+		end_offset = min_t(loff_t, isize, *ppos + len);
+
+		/*
+		 * Once we start copying data, we don't want to be touching any
+		 * cachelines that might be contended:
+		 */
+		writably_mapped = mapping_writably_mapped(in->f_mapping);
+
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			struct folio *folio = fbatch.folios[i];
+			size_t n;
+
+			if (folio_pos(folio) >= end_offset)
+				goto out;
+			folio_mark_accessed(folio);
+
+			/*
+			 * If users can be writing to this folio using arbitrary
+			 * virtual addresses, take care of potential aliasing
+			 * before reading the folio on the kernel side.
+			 */
+			if (writably_mapped)
+				flush_dcache_folio(folio);
+
+			n = splice_folio_into_pipe(pipe, folio, *ppos, len);
+			if (!n)
+				goto out;
+			len -= n;
+			total_spliced += n;
+			*ppos += n;
+			in->f_ra.prev_pos = *ppos;
+			if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+				goto out;
+		}
+
+		folio_batch_release(&fbatch);
+	} while (len);
+
+out:
+	folio_batch_release(&fbatch);
+	file_accessed(in);
+
+	return total_spliced ? total_spliced : error;
+}
+
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -392,32 +522,13 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
-	struct iov_iter to;
-	struct kiocb kiocb;
-	int ret;
-
+	if (unlikely(*ppos >= file_inode(in)->i_sb->s_maxbytes))
+		return 0;
+	if (unlikely(!len))
+		return 0;
 	if (in->f_flags & O_DIRECT)
 		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
-
-	iov_iter_pipe(&to, ITER_DEST, pipe, len);
-	init_sync_kiocb(&kiocb, in);
-	kiocb.ki_pos = *ppos;
-	ret = call_read_iter(in, &kiocb, &to);
-	if (ret > 0) {
-		*ppos = kiocb.ki_pos;
-		file_accessed(in);
-	} else if (ret < 0) {
-		/* free what was emitted */
-		pipe_discard_from(pipe, to.start_head);
-		/*
-		 * callers of ->splice_read() expect -EAGAIN on
-		 * "can't put anything in there", rather than -EFAULT.
-		 */
-		if (ret == -EFAULT)
-			ret = -EAGAIN;
-	}
-
-	return ret;
+	return generic_file_buffered_splice_read(in, ppos, pipe, len, flags);
 }
 EXPORT_SYMBOL(generic_file_splice_read);
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 04/12] iov_iter: Kill ITER_PIPE
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (2 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-13  8:28   ` Christoph Hellwig
  2023-02-09 10:29 ` [PATCH v13 05/12] iov_iter: Define flags to qualify page extraction David Howells
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

The ITER_PIPE-type iterator was only used for generic_file_splice_read(),
but that has now been switched to either pull pages directly from the
pagecache for buffered file splice-reads or to use ITER_BVEC instead for
O_DIRECT file splice-reads.  This leaves ITER_PIPE unused - so remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/cifs/file.c      |   8 +-
 include/linux/uio.h |  14 --
 lib/iov_iter.c      | 435 +-------------------------------------------
 mm/filemap.c        |   3 +-
 4 files changed, 5 insertions(+), 455 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 22dfc1f8b4f1..57ca4eea69dd 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3806,13 +3806,7 @@ cifs_readdata_to_iov(struct cifs_readdata *rdata, struct iov_iter *iter)
 		size_t copy = min_t(size_t, remaining, PAGE_SIZE);
 		size_t written;
 
-		if (unlikely(iov_iter_is_pipe(iter))) {
-			void *addr = kmap_atomic(page);
-
-			written = copy_to_iter(addr, copy, iter);
-			kunmap_atomic(addr);
-		} else
-			written = copy_page_to_iter(page, 0, copy, iter);
+		written = copy_page_to_iter(page, 0, copy, iter);
 		remaining -= written;
 		if (written < copy && iov_iter_count(iter) > 0)
 			break;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..dcc0ca5ef491 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,7 +11,6 @@
 #include <uapi/linux/uio.h>
 
 struct page;
-struct pipe_inode_info;
 
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
@@ -23,7 +22,6 @@ enum iter_type {
 	ITER_IOVEC,
 	ITER_KVEC,
 	ITER_BVEC,
-	ITER_PIPE,
 	ITER_XARRAY,
 	ITER_DISCARD,
 	ITER_UBUF,
@@ -53,15 +51,10 @@ struct iov_iter {
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
 		struct xarray *xarray;
-		struct pipe_inode_info *pipe;
 		void __user *ubuf;
 	};
 	union {
 		unsigned long nr_segs;
-		struct {
-			unsigned int head;
-			unsigned int start_head;
-		};
 		loff_t xarray_start;
 	};
 };
@@ -99,11 +92,6 @@ static inline bool iov_iter_is_bvec(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_BVEC;
 }
 
-static inline bool iov_iter_is_pipe(const struct iov_iter *i)
-{
-	return iov_iter_type(i) == ITER_PIPE;
-}
-
 static inline bool iov_iter_is_discard(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_DISCARD;
@@ -245,8 +233,6 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
-			size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..adc5e8aa8ae8 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -14,8 +14,6 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 
-#define PIPE_PARANOIA /* for now */
-
 /* covers ubuf and kbuf alike */
 #define iterate_buf(i, n, base, len, off, __p, STEP) {		\
 	size_t __maybe_unused off = 0;				\
@@ -186,156 +184,6 @@ static int copyin(void *to, const void __user *from, size_t n)
 	return res;
 }
 
-static inline struct pipe_buffer *pipe_buf(const struct pipe_inode_info *pipe,
-					   unsigned int slot)
-{
-	return &pipe->bufs[slot & (pipe->ring_size - 1)];
-}
-
-#ifdef PIPE_PARANOIA
-static bool sanity(const struct iov_iter *i)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_head = pipe->head;
-	unsigned int p_tail = pipe->tail;
-	unsigned int p_occupancy = pipe_occupancy(p_head, p_tail);
-	unsigned int i_head = i->head;
-	unsigned int idx;
-
-	if (i->last_offset) {
-		struct pipe_buffer *p;
-		if (unlikely(p_occupancy == 0))
-			goto Bad;	// pipe must be non-empty
-		if (unlikely(i_head != p_head - 1))
-			goto Bad;	// must be at the last buffer...
-
-		p = pipe_buf(pipe, i_head);
-		if (unlikely(p->offset + p->len != abs(i->last_offset)))
-			goto Bad;	// ... at the end of segment
-	} else {
-		if (i_head != p_head)
-			goto Bad;	// must be right after the last buffer
-	}
-	return true;
-Bad:
-	printk(KERN_ERR "idx = %d, offset = %d\n", i_head, i->last_offset);
-	printk(KERN_ERR "head = %d, tail = %d, buffers = %d\n",
-			p_head, p_tail, pipe->ring_size);
-	for (idx = 0; idx < pipe->ring_size; idx++)
-		printk(KERN_ERR "[%p %p %d %d]\n",
-			pipe->bufs[idx].ops,
-			pipe->bufs[idx].page,
-			pipe->bufs[idx].offset,
-			pipe->bufs[idx].len);
-	WARN_ON(1);
-	return false;
-}
-#else
-#define sanity(i) true
-#endif
-
-static struct page *push_anon(struct pipe_inode_info *pipe, unsigned size)
-{
-	struct page *page = alloc_page(GFP_USER);
-	if (page) {
-		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-		*buf = (struct pipe_buffer) {
-			.ops = &default_pipe_buf_ops,
-			.page = page,
-			.offset = 0,
-			.len = size
-		};
-	}
-	return page;
-}
-
-static void push_page(struct pipe_inode_info *pipe, struct page *page,
-			unsigned int offset, unsigned int size)
-{
-	struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-	*buf = (struct pipe_buffer) {
-		.ops = &page_cache_pipe_buf_ops,
-		.page = page,
-		.offset = offset,
-		.len = size
-	};
-	get_page(page);
-}
-
-static inline int last_offset(const struct pipe_buffer *buf)
-{
-	if (buf->ops == &default_pipe_buf_ops)
-		return buf->len;	// buf->offset is 0 for those
-	else
-		return -(buf->offset + buf->len);
-}
-
-static struct page *append_pipe(struct iov_iter *i, size_t size,
-				unsigned int *off)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int offset = i->last_offset;
-	struct pipe_buffer *buf;
-	struct page *page;
-
-	if (offset > 0 && offset < PAGE_SIZE) {
-		// some space in the last buffer; add to it
-		buf = pipe_buf(pipe, pipe->head - 1);
-		size = min_t(size_t, size, PAGE_SIZE - offset);
-		buf->len += size;
-		i->last_offset += size;
-		i->count -= size;
-		*off = offset;
-		return buf->page;
-	}
-	// OK, we need a new buffer
-	*off = 0;
-	size = min_t(size_t, size, PAGE_SIZE);
-	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-		return NULL;
-	page = push_anon(pipe, size);
-	if (!page)
-		return NULL;
-	i->head = pipe->head - 1;
-	i->last_offset = size;
-	i->count -= size;
-	return page;
-}
-
-static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int head = pipe->head;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	if (offset && i->last_offset == -offset) { // could we merge it?
-		struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
-		if (buf->page == page) {
-			buf->len += bytes;
-			i->last_offset -= bytes;
-			i->count -= bytes;
-			return bytes;
-		}
-	}
-	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-		return 0;
-
-	push_page(pipe, page, offset, bytes);
-	i->last_offset = -(offset + bytes);
-	i->head = head;
-	i->count -= bytes;
-	return bytes;
-}
-
 /*
  * fault_in_iov_iter_readable - fault in iov iterator for reading
  * @i: iterator
@@ -439,46 +287,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-// returns the offset in partial buffer (if any)
-static inline unsigned int pipe_npages(const struct iov_iter *i, int *npages)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int used = pipe->head - pipe->tail;
-	int off = i->last_offset;
-
-	*npages = max((int)pipe->max_usage - used, 0);
-
-	if (off > 0 && off < PAGE_SIZE) { // anon and not full
-		(*npages)++;
-		return off;
-	}
-	return 0;
-}
-
-static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
-				struct iov_iter *i)
-{
-	unsigned int off, chunk;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	for (size_t n = bytes; n; n -= chunk) {
-		struct page *page = append_pipe(i, n, &off);
-		chunk = min_t(size_t, n, PAGE_SIZE - off);
-		if (!page)
-			return bytes - n;
-		memcpy_to_page(page, off, addr, chunk);
-		addr += chunk;
-	}
-	return bytes;
-}
-
 static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
 			      __wsum sum, size_t off)
 {
@@ -486,44 +294,10 @@ static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
 	return csum_block_add(sum, next, off);
 }
 
-static size_t csum_and_copy_to_pipe_iter(const void *addr, size_t bytes,
-					 struct iov_iter *i, __wsum *sump)
-{
-	__wsum sum = *sump;
-	size_t off = 0;
-	unsigned int chunk, r;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	while (bytes) {
-		struct page *page = append_pipe(i, bytes, &r);
-		char *p;
-
-		if (!page)
-			break;
-		chunk = min_t(size_t, bytes, PAGE_SIZE - r);
-		p = kmap_local_page(page);
-		sum = csum_and_memcpy(p + r, addr + off, chunk, sum, off);
-		kunmap_local(p);
-		off += chunk;
-		bytes -= chunk;
-	}
-	*sump = sum;
-	return off;
-}
-
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_pipe_to_iter(addr, bytes, i);
 	if (user_backed_iter(i))
 		might_fault();
 	iterate_and_advance(i, bytes, base, len, off,
@@ -545,42 +319,6 @@ static int copyout_mc(void __user *to, const void *from, size_t n)
 	return n;
 }
 
-static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
-				struct iov_iter *i)
-{
-	size_t xfer = 0;
-	unsigned int off, chunk;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	while (bytes) {
-		struct page *page = append_pipe(i, bytes, &off);
-		unsigned long rem;
-		char *p;
-
-		if (!page)
-			break;
-		chunk = min_t(size_t, bytes, PAGE_SIZE - off);
-		p = kmap_local_page(page);
-		rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
-		chunk -= rem;
-		kunmap_local(p);
-		xfer += chunk;
-		bytes -= chunk;
-		if (rem) {
-			iov_iter_revert(i, rem);
-			break;
-		}
-	}
-	return xfer;
-}
-
 /**
  * _copy_mc_to_iter - copy to iter with source memory error exception handling
  * @addr: source kernel address
@@ -600,9 +338,8 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
  *   alignment and poison alignment assumptions to avoid re-triggering
  *   hardware exceptions.
  *
- * * ITER_KVEC, ITER_PIPE, and ITER_BVEC can return short copies.
- *   Compare to copy_to_iter() where only ITER_IOVEC attempts might return
- *   a short copy.
+ * * ITER_KVEC and ITER_BVEC can return short copies.  Compare to
+ *   copy_to_iter() where only ITER_IOVEC attempts might return a short copy.
  *
  * Return: number of bytes copied (may be %0)
  */
@@ -610,8 +347,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_mc_pipe_to_iter(addr, bytes, i);
 	if (user_backed_iter(i))
 		might_fault();
 	__iterate_and_advance(i, bytes, base, len, off,
@@ -717,8 +452,6 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		return 0;
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_page_to_iter_pipe(page, offset, bytes, i);
 	page += offset / PAGE_SIZE; // first subpage
 	offset %= PAGE_SIZE;
 	while (1) {
@@ -767,36 +500,8 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
-static size_t pipe_zero(size_t bytes, struct iov_iter *i)
-{
-	unsigned int chunk, off;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	for (size_t n = bytes; n; n -= chunk) {
-		struct page *page = append_pipe(i, n, &off);
-		char *p;
-
-		if (!page)
-			return bytes - n;
-		chunk = min_t(size_t, n, PAGE_SIZE - off);
-		p = kmap_local_page(page);
-		memset(p + off, 0, chunk);
-		kunmap_local(p);
-	}
-	return bytes;
-}
-
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
-	if (unlikely(iov_iter_is_pipe(i)))
-		return pipe_zero(bytes, i);
 	iterate_and_advance(i, bytes, base, len, count,
 		clear_user(base, len),
 		memset(base, 0, len)
@@ -827,32 +532,6 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-static void pipe_advance(struct iov_iter *i, size_t size)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int off = i->last_offset;
-
-	if (!off && !size) {
-		pipe_discard_from(pipe, i->start_head); // discard everything
-		return;
-	}
-	i->count -= size;
-	while (1) {
-		struct pipe_buffer *buf = pipe_buf(pipe, i->head);
-		if (off) /* make it relative to the beginning of buffer */
-			size += abs(off) - buf->offset;
-		if (size <= buf->len) {
-			buf->len = size;
-			i->last_offset = last_offset(buf);
-			break;
-		}
-		size -= buf->len;
-		i->head++;
-		off = 0;
-	}
-	pipe_discard_from(pipe, i->head + 1); // discard everything past this one
-}
-
 static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
 {
 	const struct bio_vec *bvec, *end;
@@ -904,8 +583,6 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
-	} else if (iov_iter_is_pipe(i)) {
-		pipe_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
 		i->count -= size;
 	}
@@ -919,26 +596,6 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	if (WARN_ON(unroll > MAX_RW_COUNT))
 		return;
 	i->count += unroll;
-	if (unlikely(iov_iter_is_pipe(i))) {
-		struct pipe_inode_info *pipe = i->pipe;
-		unsigned int head = pipe->head;
-
-		while (head > i->start_head) {
-			struct pipe_buffer *b = pipe_buf(pipe, --head);
-			if (unroll < b->len) {
-				b->len -= unroll;
-				i->last_offset = last_offset(b);
-				i->head = head;
-				return;
-			}
-			unroll -= b->len;
-			pipe_buf_release(pipe, b);
-			pipe->head--;
-		}
-		i->last_offset = 0;
-		i->head = head;
-		return;
-	}
 	if (unlikely(iov_iter_is_discard(i)))
 		return;
 	if (unroll <= i->iov_offset) {
@@ -1026,24 +683,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
-			struct pipe_inode_info *pipe,
-			size_t count)
-{
-	BUG_ON(direction != READ);
-	WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
-	*i = (struct iov_iter){
-		.iter_type = ITER_PIPE,
-		.data_source = false,
-		.pipe = pipe,
-		.head = pipe->head,
-		.start_head = pipe->head,
-		.last_offset = 0,
-		.count = count
-	};
-}
-EXPORT_SYMBOL(iov_iter_pipe);
-
 /**
  * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xarray
  * @i: The iterator to initialise.
@@ -1168,19 +807,6 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 	if (iov_iter_is_bvec(i))
 		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
 
-	if (iov_iter_is_pipe(i)) {
-		size_t size = i->count;
-
-		if (size & len_mask)
-			return false;
-		if (size && i->last_offset > 0) {
-			if (i->last_offset & addr_mask)
-				return false;
-		}
-
-		return true;
-	}
-
 	if (iov_iter_is_xarray(i)) {
 		if (i->count & len_mask)
 			return false;
@@ -1250,14 +876,6 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (iov_iter_is_bvec(i))
 		return iov_iter_alignment_bvec(i);
 
-	if (iov_iter_is_pipe(i)) {
-		size_t size = i->count;
-
-		if (size && i->last_offset > 0)
-			return size | i->last_offset;
-		return size;
-	}
-
 	if (iov_iter_is_xarray(i))
 		return (i->xarray_start + i->iov_offset) | i->count;
 
@@ -1309,36 +927,6 @@ static int want_pages_array(struct page ***res, size_t size,
 	return count;
 }
 
-static ssize_t pipe_get_pages(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
-{
-	unsigned int npages, count, off, chunk;
-	struct page **p;
-	size_t left;
-
-	if (!sanity(i))
-		return -EFAULT;
-
-	*start = off = pipe_npages(i, &npages);
-	if (!npages)
-		return -EFAULT;
-	count = want_pages_array(pages, maxsize, off, min(npages, maxpages));
-	if (!count)
-		return -ENOMEM;
-	p = *pages;
-	for (npages = 0, left = maxsize ; npages < count; npages++, left -= chunk) {
-		struct page *page = append_pipe(i, left, &off);
-		if (!page)
-			break;
-		chunk = min_t(size_t, left, PAGE_SIZE - off);
-		get_page(*p++ = page);
-	}
-	if (!npages)
-		return -EFAULT;
-	return maxsize - left;
-}
-
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
 					  pgoff_t index, unsigned int nr_pages)
 {
@@ -1486,8 +1074,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		}
 		return maxsize;
 	}
-	if (iov_iter_is_pipe(i))
-		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
 	return -EFAULT;
@@ -1577,9 +1163,7 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
 	}
 
 	sum = csum_shift(csstate->csum, csstate->off);
-	if (unlikely(iov_iter_is_pipe(i)))
-		bytes = csum_and_copy_to_pipe_iter(addr, bytes, i, &sum);
-	else iterate_and_advance(i, bytes, base, len, off, ({
+	iterate_and_advance(i, bytes, base, len, off, ({
 		next = csum_and_copy_to_user(addr + off, base, len);
 		sum = csum_block_add(sum, next, off);
 		next ? 0 : len;
@@ -1664,15 +1248,6 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		return iov_npages(i, maxpages);
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
-	if (iov_iter_is_pipe(i)) {
-		int npages;
-
-		if (!sanity(i))
-			return 0;
-
-		pipe_npages(i, &npages);
-		return min(npages, maxpages);
-	}
 	if (iov_iter_is_xarray(i)) {
 		unsigned offset = (i->xarray_start + i->iov_offset) % PAGE_SIZE;
 		int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1685,10 +1260,6 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
 	*new = *old;
-	if (unlikely(iov_iter_is_pipe(new))) {
-		WARN_ON(1);
-		return NULL;
-	}
 	if (iov_iter_is_bvec(new))
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
diff --git a/mm/filemap.c b/mm/filemap.c
index b31168a9bafd..6970be64a3e0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2695,8 +2695,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
 			break;
 
-		error = filemap_get_pages(iocb, iter->count, &fbatch,
-					  iov_iter_is_pipe(iter));
+		error = filemap_get_pages(iocb, iter->count, &fbatch, false);
 		if (error < 0)
 			break;
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 05/12] iov_iter: Define flags to qualify page extraction.
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (3 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 04/12] iov_iter: Kill ITER_PIPE David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 10:29 ` [PATCH v13 06/12] iov_iter: Add a function to extract a page list from an iterator David Howells
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Define flags to qualify page extraction to pass into iov_iter_*_pages*()
rather than passing in FOLL_* flags.

For now only a flag to allow peer-to-peer DMA is supported.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
---

Notes:
    ver #12)
     - Use __bitwise for the extraction flags typedef.
    
    ver #11)
     - Use __bitwise for the extraction flags.
    
    ver #9)
     - Change extract_flags to extraction_flags.
    
    ver #7)
     - Don't use FOLL_* as a parameter, but rather define constants
       specifically to use with iov_iter_*_pages*().
     - Drop the I/O direction constants for now.

 block/bio.c         |  6 +++---
 block/blk-map.c     |  8 ++++----
 include/linux/uio.h | 10 ++++++++--
 lib/iov_iter.c      | 14 ++++++++------
 4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ab59a491a883..b97f3991c904 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,11 +1245,11 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	unsigned int gup_flags = 0;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset, trim;
@@ -1264,7 +1264,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
-		gup_flags |= FOLL_PCI_P2PDMA;
+		extraction_flags |= ITER_ALLOW_P2PDMA;
 
 	/*
 	 * Each segment in the iov is required to be a block size multiple.
@@ -1275,7 +1275,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 */
 	size = iov_iter_get_pages(iter, pages,
 				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+				  nr_pages, &offset, extraction_flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..080dd60485be 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -265,9 +265,9 @@ static struct bio *blk_rq_map_bio_alloc(struct request *rq,
 static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		gfp_t gfp_mask)
 {
+	iov_iter_extraction_t extraction_flags = 0;
 	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
 	unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
-	unsigned int gup_flags = 0;
 	struct bio *bio;
 	int ret;
 	int j;
@@ -280,7 +280,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		return -ENOMEM;
 
 	if (blk_queue_pci_p2pdma(rq->q))
-		gup_flags |= FOLL_PCI_P2PDMA;
+		extraction_flags |= ITER_ALLOW_P2PDMA;
 
 	while (iov_iter_count(iter)) {
 		struct page **pages, *stack_pages[UIO_FASTIOV];
@@ -291,10 +291,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
 			pages = stack_pages;
 			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, gup_flags);
+						   nr_vecs, &offs, extraction_flags);
 		} else {
 			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, gup_flags);
+						LONG_MAX, &offs, extraction_flags);
 		}
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index dcc0ca5ef491..af70e4c9ea27 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -12,6 +12,8 @@
 
 struct page;
 
+typedef unsigned int __bitwise iov_iter_extraction_t;
+
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
 	size_t iov_len;
@@ -238,12 +240,12 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
-		unsigned gup_flags);
+		iov_iter_extraction_t extraction_flags);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		struct page ***pages, size_t maxsize, size_t *start,
-		unsigned gup_flags);
+		iov_iter_extraction_t extraction_flags);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
@@ -346,4 +348,8 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	};
 }
 
+/* Flags for iov_iter_get/extract_pages*() */
+/* Allow P2PDMA on the extracted pages */
+#define ITER_ALLOW_P2PDMA	((__force iov_iter_extraction_t)0x01)
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index adc5e8aa8ae8..34ee3764d0fa 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1020,9 +1020,9 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start,
-		   unsigned int gup_flags)
+		   iov_iter_extraction_t extraction_flags)
 {
-	unsigned int n;
+	unsigned int n, gup_flags = 0;
 
 	if (maxsize > i->count)
 		maxsize = i->count;
@@ -1030,6 +1030,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 	if (maxsize > MAX_RW_COUNT)
 		maxsize = MAX_RW_COUNT;
+	if (extraction_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
 
 	if (likely(user_backed_iter(i))) {
 		unsigned long addr;
@@ -1081,14 +1083,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start, unsigned gup_flags)
+		   size_t *start, iov_iter_extraction_t extraction_flags)
 {
 	if (!maxpages)
 		return 0;
 	BUG_ON(!pages);
 
 	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
-					  start, gup_flags);
+					  start, extraction_flags);
 }
 EXPORT_SYMBOL_GPL(iov_iter_get_pages);
 
@@ -1101,14 +1103,14 @@ EXPORT_SYMBOL(iov_iter_get_pages2);
 
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   size_t *start, unsigned gup_flags)
+		   size_t *start, iov_iter_extraction_t extraction_flags)
 {
 	ssize_t len;
 
 	*pages = NULL;
 
 	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
-					 gup_flags);
+					 extraction_flags);
 	if (len <= 0) {
 		kvfree(*pages);
 		*pages = NULL;


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 06/12] iov_iter: Add a function to extract a page list from an iterator
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (4 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 05/12] iov_iter: Define flags to qualify page extraction David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 10:29 ` [PATCH v13 07/12] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a pin added or nothing,
depending on the type of iterator.

Add a second function, iov_iter_extract_will_pin(), to determine how the
cleanup should be done.

There are two cases:

 (1) ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins (FOLL_PIN) obtained on them so that a
     concurrent fork() will forcibly copy the page so that DMA is done
     to/from the parent's buffer and is unavailable to/unaffected by the
     child process.

     iov_iter_extract_will_pin() will return true for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     iov_iter_extract_will_pin() will return false.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #12)
     - ITER_PIPE is gone, so drop related bits.
     - Don't specify FOLL_PIN as that's implied by pin_user_pages_fast().
    
    ver #11)
     - Fix iov_iter_extract_kvec_pages() to include the offset into the page in
       the returned starting offset.
     - Use __bitwise for the extraction flags
    
    ver #10)
     - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
    
    ver #9)
     - Rename iov_iter_extract_mode() to iov_iter_extract_will_pin() and make
       it return true/false not FOLL_PIN/0 as FOLL_PIN is going to be made
       private to mm/.
     - Change extract_flags to extraction_flags.
    
    ver #8)
     - It seems that all DIO is supposed to be done under FOLL_PIN now, and not
       FOLL_GET, so switch to only using pin_user_pages() for user-backed
       iters.
     - Wrap an argument in brackets in the iov_iter_extract_mode() macro.
     - Drop the extract_flags argument to iov_iter_extract_mode() for now
       [hch].
    
    ver #7)
     - Switch to passing in iter-specific flags rather than FOLL_* flags.
     - Drop the direction flags for now.
     - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
     - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
     - Add support for extraction from KVEC-type iters.
     - Use iov_iter_advance() rather than open-coding it.
     - Make BVEC- and KVEC-type skip over initial empty vectors.
    
    ver #6)
     - Add back the function to indicate the cleanup mode.
     - Drop the cleanup_mode return arg to iov_iter_extract_pages().
     - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
       data_source.
    
    ver #4)
     - Use ITER_SOURCE/DEST instead of WRITE/READ.
     - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
    
    ver #3)
     - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
       to get/pin_user_pages_fast()[1].

 include/linux/uio.h |  27 ++++-
 lib/iov_iter.c      | 264 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 290 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index af70e4c9ea27..cf6658066736 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -347,9 +347,34 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 		.count = count
 	};
 }
-
 /* Flags for iov_iter_get/extract_pages*() */
 /* Allow P2PDMA on the extracted pages */
 #define ITER_ALLOW_P2PDMA	((__force iov_iter_extraction_t)0x01)
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       iov_iter_extraction_t extraction_flags,
+			       size_t *offset0);
+
+/**
+ * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ *
+ * Examine the iterator and indicate by returning true or false as to how, if
+ * at all, pages extracted from the iterator will be retained by the extraction
+ * function.
+ *
+ * %true indicates that the pages will have a pin placed in them that the
+ * caller must unpin.  This is must be done for DMA/async DIO to force fork()
+ * to forcibly copy a page for the child (the parent must retain the original
+ * page).
+ *
+ * %false indicates that no measures are taken and that it's up to the caller
+ * to retain the pages.
+ */
+static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
+{
+	return user_backed_iter(iter);
+}
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 34ee3764d0fa..8d34b6552179 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1487,3 +1487,267 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     iov_iter_extraction_t extraction_flags,
+					     size_t *offset0)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->bvec->bv_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->bvec++;
+		skip = 0;
+	}
+
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	const void *kaddr;
+	size_t skip = i->iov_offset, offset, len;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->kvec->iov_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	kaddr = i->kvec->iov_base + skip;
+	offset = (unsigned long)kaddr & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	kaddr -= offset;
+	len = offset + maxsize;
+	for (k = 0; k < maxpages; k++) {
+		size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+		if (is_vmalloc_or_module_addr(kaddr))
+			page = vmalloc_to_page(kaddr);
+		else
+			page = virt_to_page(kaddr);
+
+		p[k] = page;
+		len -= seg;
+		kaddr += PAGE_SIZE;
+	}
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used if the iterator is user-backed
+ * (IOBUF/UBUF).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes forces fork() to give the
+ * child a copy of the page.
+ */
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages,
+					   size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = 0;
+	size_t offset;
+	int res;
+
+	if (i->data_source == ITER_DEST)
+		gup_flags |= FOLL_WRITE;
+	if (extraction_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extraction_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA
+ * be allowed on the pages extracted.
+ *
+ * The iov_iter_extract_will_pin() function can be used to query how cleanup
+ * should be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF), pins will be
+ *      added to the pages, but refs will not be taken.
+ *      iov_iter_extract_will_pin() will return true.
+ *
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
+ *      iov_iter_extract_will_pin() will return 0.
+ *
+ * Note also:
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       iov_iter_extraction_t extraction_flags,
+			       size_t *offset0)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_kvec(i))
+		return iov_iter_extract_kvec_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, extraction_flags,
+						     offset0);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 07/12] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (5 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 06/12] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 10:29 ` [PATCH v13 08/12] block: Fix bio_flagged() so that gcc can better optimise it David Howells
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm, John Hubbard

ZERO_PAGE can't go away, no need to hold an extra reference.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: linux-fsdevel@vger.kernel.org
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..47db4ead1e74 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 08/12] block: Fix bio_flagged() so that gcc can better optimise it
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (6 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 07/12] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 10:29 ` [PATCH v13 09/12] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Fix bio_flagged() so that multiple instances of it, such as:

	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
	    bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero.  This results in an out-of-line bio_release_page()
being compiled to something like:

   <+0>:     mov    0x14(%rdi),%eax
   <+3>:     test   $0x1,%al
   <+5>:     jne    0xffffffff816dac53 <bio_release_pages+11>
   <+7>:     test   $0x2,%al
   <+9>:     je     0xffffffff816dac5c <bio_release_pages+20>
   <+11>:    movzbl %sil,%esi
   <+15>:    jmp    0xffffffff816daba1 <__bio_release_pages>
   <+20>:    jmp    0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool.  Removing it
results in:

   <+0>:     testb  $0x3,0x14(%rdi)
   <+4>:     je     0xffffffff816e4af4 <bio_release_pages+15>
   <+6>:     movzbl %sil,%esi
   <+10>:    jmp    0xffffffff816dab7c <__bio_release_pages>
   <+15>:    jmp    0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
 include/linux/bio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index c1da63f6c808..10366b8bdb13 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -227,7 +227,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
 {
-	return (bio->bi_flags & (1U << bit)) != 0;
+	return bio->bi_flags & (1U << bit);
 }
 
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 09/12] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (7 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 08/12] block: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 10:29 ` [PATCH v13 10/12] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

From: Christoph Hellwig <hch@lst.de>

Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #8)
     - Split out from another patch [hch].
     - Don't default to BIO_PAGE_REFFED [hch].
    
    ver #5)
     - Split from patch that uses iov_iter_extract_pages().

 block/bio.c               | 2 +-
 block/blk-map.c           | 1 +
 fs/direct-io.c            | 2 ++
 fs/iomap/direct-io.c      | 1 -
 include/linux/bio.h       | 2 +-
 include/linux/blk_types.h | 2 +-
 6 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b97f3991c904..bf9bf53232be 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1198,7 +1198,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
 	bio->bi_iter.bi_bvec_done = iter->iov_offset;
 	bio->bi_iter.bi_size = size;
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	bio_set_flag(bio, BIO_CLONED);
 }
 
@@ -1343,6 +1342,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return 0;
 	}
 
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	do {
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
diff --git a/block/blk-map.c b/block/blk-map.c
index 080dd60485be..f1f70b50388d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	if (blk_queue_pci_p2pdma(rq->q))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	while (iov_iter_count(iter)) {
 		struct page **pages, *stack_pages[UIO_FASTIOV];
 		ssize_t bytes;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 03d381377ae1..07810465fc9d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -403,6 +403,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_aio;
 	else
 		bio->bi_end_io = dio_bio_end_io;
+	/* for now require references for all pages */
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 47db4ead1e74..c0e75900e754 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 10366b8bdb13..805957c99147 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,7 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..7daa261f4f98 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
  * bio flags
  */
 enum {
-	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_REFFED,	/* put pages in bio_release_pages() */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 10/12] block: Add BIO_PAGE_PINNED and associated infrastructure
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (8 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 09/12] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 10:29 ` [PATCH v13 11/12] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #9)
     - Only consider pinning in bio_set_cleanup_mode().  Ref'ing pages in
       struct bio is going away.
     - page_put_unpin() is removed; call unpin_user_page() and put_page()
       directly.
     - Use bio_release_page() in __bio_release_pages().
     - BIO_PAGE_PINNED and BIO_PAGE_REFFED can't both be set, so use if-else
       when testing both of them.
    
    ver #8)
     - Move the infrastructure to clean up pinned pages to this patch [hch].
     - Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
       probably be removed at some point.  FOLL_PIN can then be renumbered
       first.

 block/bio.c               |  6 +++---
 block/blk.h               | 12 ++++++++++++
 include/linux/bio.h       |  3 ++-
 include/linux/blk_types.h |  1 +
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index bf9bf53232be..547e38883934 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1176,7 +1176,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1496,8 +1496,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will unpin each page and will run one bio_put() against the
+ * BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..f02381405311 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,18 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio_flagged(bio, BIO_PAGE_PINNED))
+		unpin_user_page(page);
+	else if (bio_flagged(bio, BIO_PAGE_REFFED))
+		put_page(page);
+}
+
 struct request_queue *blk_alloc_queue(int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 805957c99147..b2c09997d79c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,8 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (bio_flagged(bio, BIO_PAGE_REFFED))
+	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+	    bio_flagged(bio, BIO_PAGE_PINNED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7daa261f4f98..a0e339ff3d09 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,6 +318,7 @@ struct bio {
  * bio flags
  */
 enum {
+	BIO_PAGE_PINNED,	/* Unpin pages in bio_release_pages() */
 	BIO_PAGE_REFFED,	/* put pages in bio_release_pages() */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 11/12] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (9 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 10/12] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-09 10:29 ` [PATCH v13 12/12] block: convert bio_map_user_iov " David Howells
  2023-02-10 22:31 ` [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) Jens Axboe
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #8)
     - Split the patch up a bit [hch].
     - We should only be using pinned/non-pinned pages and not ref'd pages,
       so adjust the comments appropriately.
    
    ver #7)
     - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
    
    ver #5)
     - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
       BIO_* flags and got rid of bi_cleanup_mode.
     - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

 block/bio.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 547e38883934..fc57f0aa098e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1212,7 +1212,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1226,7 +1226,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1237,10 +1237,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
+ * For a multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1272,9 +1272,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, extraction_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, extraction_flags, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1307,7 +1307,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1342,7 +1342,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return 0;
 	}
 
-	bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (iov_iter_extract_will_pin(iter))
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 	do {
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v13 12/12] block: convert bio_map_user_iov to use iov_iter_extract_pages
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (10 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 11/12] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
@ 2023-02-09 10:29 ` David Howells
  2023-02-10 22:31 ` [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) Jens Axboe
  12 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 10:29 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #8)
     - Split the patch up a bit [hch].
     - We should only be using pinned/non-pinned pages and not ref'd pages,
       so adjust the comments appropriately.
    
    ver #7)
     - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
    
    ver #5)
     - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
       BIO_* flags and got rid of bi_cleanup_mode.
     - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

 block/blk-map.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index f1f70b50388d..0f1593e144da 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -281,22 +281,21 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 
 	if (blk_queue_pci_p2pdma(rq->q))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
+	if (iov_iter_extract_will_pin(iter))
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 
-	bio_set_flag(bio, BIO_PAGE_REFFED);
 	while (iov_iter_count(iter)) {
-		struct page **pages, *stack_pages[UIO_FASTIOV];
+		struct page *stack_pages[UIO_FASTIOV];
+		struct page **pages = stack_pages;
 		ssize_t bytes;
 		size_t offs;
 		int npages;
 
-		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
-			pages = stack_pages;
-			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, extraction_flags);
-		} else {
-			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, extraction_flags);
-		}
+		if (nr_vecs > ARRAY_SIZE(stack_pages))
+			pages = NULL;
+
+		bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+					       nr_vecs, extraction_flags, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -318,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						bio_release_page(bio, page);
 					break;
 				}
 
@@ -330,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			bio_release_page(bio, pages[j++]);
 		if (pages != stack_pages)
 			kvfree(pages);
 		/* couldn't stuff something into bio? */


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE David Howells
@ 2023-02-09 14:53   ` Matthew Wilcox
  2023-02-09 15:06   ` David Howells
  2023-02-09 16:15   ` [PATCH v14 " David Howells
  2 siblings, 0 replies; 35+ messages in thread
From: Matthew Wilcox @ 2023-02-09 14:53 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

On Thu, Feb 09, 2023 at 10:29:43AM +0000, David Howells wrote:
> +	npages = alloc_pages_bulk_list(GFP_USER, npages, &pages);

Please don't use alloc_pages_bulk_list().  If nobody uses it, it can go
away again soon.  Does alloc_pages_bulk_array() work for you?  It's
faster.

> +	/* Free any pages that didn't get touched at all. */
> +	for (; reclaim >= PAGE_SIZE; reclaim -= PAGE_SIZE)
> +		__free_page(bv[--npages].bv_page);

If you have that array, you can then use release_pages() to free
them, which will be faster.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE David Howells
  2023-02-09 14:53   ` Matthew Wilcox
@ 2023-02-09 15:06   ` David Howells
  2023-02-09 16:15   ` [PATCH v14 " David Howells
  2 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 15:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Matthew Wilcox <willy@infradead.org> wrote:

> On Thu, Feb 09, 2023 at 10:29:43AM +0000, David Howells wrote:
> > +	npages = alloc_pages_bulk_list(GFP_USER, npages, &pages);
> 
> Please don't use alloc_pages_bulk_list().  If nobody uses it, it can go
> away again soon.  Does alloc_pages_bulk_array() work for you?  It's
> faster.

Sure.

> > +	/* Free any pages that didn't get touched at all. */
> > +	for (; reclaim >= PAGE_SIZE; reclaim -= PAGE_SIZE)
> > +		__free_page(bv[--npages].bv_page);
> 
> If you have that array, you can then use release_pages() to free
> them, which will be faster.

Um.  I would normally overlay the array on end of the bvec[] so that I could
save on an allocation (I have to fill in the bvec[] anyway) - which means I
wouldn't still have the array at release time.  But in this case I can make an
exception, though I would've thought that the expectation would be that all
the requested data would be fetched.

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v14 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE David Howells
  2023-02-09 14:53   ` Matthew Wilcox
  2023-02-09 15:06   ` David Howells
@ 2023-02-09 16:15   ` David Howells
  2023-02-13  8:22     ` Christoph Hellwig
                       ` (3 more replies)
  2 siblings, 4 replies; 35+ messages in thread
From: David Howells @ 2023-02-09 16:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Matthew Wilcox <willy@infradead.org> wrote:

> Please don't use alloc_pages_bulk_list().
...
> If you have that array, you can then use release_pages() ...

Done.  See attached replacement patch.

David
---
splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE

With the upcoming iov_iter_extract_pages() function, pages extracted from a
non-user-backed iterator such as ITER_PIPE aren't pinned.
__iomap_dio_rw(), however, calls iov_iter_revert() to shorten the iterator
to just the bufferage it is going to use - which has the side-effect of
freeing the excess pipe buffers, even though they're attached to a bio and
may get written to by DMA (thanks to Hillf Danton for spotting this[1]).

This then causes memory corruption that is particularly noticable when the
syzbot test[2] is run.  The test boils down to:

        out = creat(argv[1], 0666);
        ftruncate(out, 0x800);
        lseek(out, 0x200, SEEK_SET);
        in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
        sendfile(out, in, NULL, 0x1dd00);

run repeatedly in parallel.  What I think is happening is that ftruncate()
occasionally shortens the DIO read that's about to be made by sendfile's
splice core by reducing i_size.

Fix this by splitting the handling of a splice from an O_DIRECT file fd off
from that of non-DIO and in this case, replacing the use of an ITER_PIPE
iterator with an ITER_BVEC iterator for which reversion won't free the
buffers.  The DIO-specific code bulk allocates all the buffers it thinks it
is going to use in advance, does the read synchronously and only then trims
the buffer down.  The pages we did use get pushed into the pipe.

This should be more efficient for DIO read by virtue of doing a bulk page
allocation, but slightly less efficient by ignoring any partial page in the
pipe.

Fixes: 920756a3306a ("block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages")
Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
---
Notes:
    ver #14)
     - Use alloc_pages_bulk_array() rather than alloc_pages_bulk_list().
     - Use release_pages() rather than a loop calling __free_page().
    
    ver #13)
     - Don't completely replace generic_file_splice_read(), but rather only use
       this if we're doing a splicing from an O_DIRECT file fd.

 fs/splice.c |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 5969b7a1d353..91244270b36e 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -282,6 +282,101 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
+/*
+ * Splice data from an O_DIRECT file into pages and then add them to the output
+ * pipe.
+ */
+static ssize_t generic_file_direct_splice_read(struct file *in, loff_t *ppos,
+					       struct pipe_inode_info *pipe,
+					       size_t len, unsigned int flags)
+{
+	struct iov_iter to;
+	struct bio_vec *bv;
+	struct kiocb kiocb;
+	struct page **pages;
+	unsigned int head;
+	ssize_t ret;
+	size_t used, npages, chunk, remain, reclaim;
+	int i;
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+	npages = DIV_ROUND_UP(len, PAGE_SIZE);
+
+	bv = kzalloc(array_size(npages, sizeof(bv[0])) +
+		     array_size(npages, sizeof(struct page *)), GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
+
+	pages = (void *)(bv + npages);
+	npages = alloc_pages_bulk_array(GFP_USER, npages, pages);
+	if (!npages) {
+		kfree(bv);
+		return -ENOMEM;
+	}
+
+	remain = len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	for (i = 0; i < npages; i++) {
+		chunk = min_t(size_t, PAGE_SIZE, remain);
+		bv[i].bv_page = pages[i];
+		bv[i].bv_offset = 0;
+		bv[i].bv_len = chunk;
+		remain -= chunk;
+	}
+
+	/* Do the I/O */
+	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
+	init_sync_kiocb(&kiocb, in);
+	kiocb.ki_pos = *ppos;
+	ret = call_read_iter(in, &kiocb, &to);
+
+	reclaim = npages * PAGE_SIZE;
+	remain = 0;
+	if (ret > 0) {
+		reclaim -= ret;
+		remain = ret;
+		*ppos = kiocb.ki_pos;
+		file_accessed(in);
+	} else if (ret < 0) {
+		/*
+		 * callers of ->splice_read() expect -EAGAIN on
+		 * "can't put anything in there", rather than -EFAULT.
+		 */
+		if (ret == -EFAULT)
+			ret = -EAGAIN;
+	}
+
+	/* Free any pages that didn't get touched at all. */
+	reclaim /= PAGE_SIZE;
+	if (reclaim) {
+		npages -= reclaim;
+		release_pages(pages + npages, reclaim);
+	}
+
+	/* Push the remaining pages into the pipe. */
+	head = pipe->head;
+	for (i = 0; i < npages; i++) {
+		struct pipe_buffer *buf = &pipe->bufs[head & (pipe->ring_size - 1)];
+
+		chunk = min_t(size_t, remain, PAGE_SIZE);
+		*buf = (struct pipe_buffer) {
+			.ops	= &default_pipe_buf_ops,
+			.page	= bv[i].bv_page,
+			.offset	= 0,
+			.len	= chunk,
+		};
+		head++;
+		remain -= chunk;
+	}
+	pipe->head = head;
+
+	kfree(bv);
+	return ret;
+}
+
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -303,6 +398,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	struct kiocb kiocb;
 	int ret;
 
+	if (in->f_flags & O_DIRECT)
+		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
+
 	iov_iter_pipe(&to, ITER_DEST, pipe, len);
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list)
  2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (11 preceding siblings ...)
  2023-02-09 10:29 ` [PATCH v13 12/12] block: convert bio_map_user_iov " David Howells
@ 2023-02-10 22:31 ` Jens Axboe
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2023-02-10 22:31 UTC (permalink / raw)
  To: David Howells, Al Viro, Christoph Hellwig
  Cc: Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton, linux-fsdevel,
	linux-block, linux-kernel, linux-mm

On 2/9/23 3:29?AM, David Howells wrote:
> Hi Jens, Al, Christoph,
> 
> Here are patches to provide support for extracting pages from an iov_iter
> and to use this in the extraction functions in the block layer bio code.

[snip]

I updated the branch to v13, just as a heads-up. Still in
for-6.3/iov-extract as before.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v14 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-09 16:15   ` [PATCH v14 " David Howells
@ 2023-02-13  8:22     ` Christoph Hellwig
  2023-02-15 13:17     ` David Howells
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2023-02-13  8:22 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Jens Axboe, Al Viro, Christoph Hellwig, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

> +	if (!bv)
> +		return -ENOMEM;
> +
> +	pages = (void *)(bv + npages);

I think this cast should be to struct page **… not void *.

> +	npages = alloc_pages_bulk_array(GFP_USER, npages, pages);
> +	if (!npages) {
> +		kfree(bv);
> +		return -ENOMEM;
> +	}

> +	reclaim = npages * PAGE_SIZE;
> +	remain = 0;
> +	if (ret > 0) {
> +		reclaim -= ret;
> +		remain = ret;

...

> +	/* Free any pages that didn't get touched at all. */
> +	reclaim /= PAGE_SIZE;

Any reason not to keep reclaim in PAGE_SIZE units to start with?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 02/12] mm: Pass info, not iter, into filemap_get_pages() and unstatic it
  2023-02-09 10:29 ` [PATCH v13 02/12] mm: Pass info, not iter, into filemap_get_pages() and unstatic it David Howells
@ 2023-02-13  8:22   ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2023-02-13  8:22 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE David Howells
@ 2023-02-13  8:28   ` Christoph Hellwig
  2023-02-13 10:11   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2023-02-13  8:28 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

> The code is loosely based on filemap_read() and might belong in
> mm/filemap.c with that as it needs to use filemap_get_pages().

Yes, I thunk it should go into filemap.c

> +	while (spliced < size &&
> +	       !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
> +		struct pipe_buffer *buf = &pipe->bufs[pipe->head & (pipe->ring_size - 1)];

Can you please facto this calculation, that is also duplicated in patch
one into a helper?

static inline struct pipe_buffer *pipe_head_buf(struct pipe_inode_info *pipe)
{
	return &pipe->bufs[pipe->head & (pipe->ring_size - 1)];
}

> +	struct folio_batch fbatch;
> +	size_t total_spliced = 0, used, npages;
> +	loff_t isize, end_offset;
> +	bool writably_mapped;
> +	int i, error = 0;
> +
> +	struct kiocb iocb = {

Why the empty line before this declaration?

> +		.ki_filp	= in,
> +		.ki_pos		= *ppos,
> +	};

Also why doesn't this use init_sync_kiocb?

>  	if (in->f_flags & O_DIRECT)
>  		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
> +	return generic_file_buffered_splice_read(in, ppos, pipe, len, flags);

Btw, can we drop the verbose generic_file_ prefix here?

generic_file_buffered_splice_read really should be filemap_splice_read
and be in filemap.c.  generic_file_direct_splice_read I'd just name
direct_splice_read.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 04/12] iov_iter: Kill ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 04/12] iov_iter: Kill ITER_PIPE David Howells
@ 2023-02-13  8:28   ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2023-02-13  8:28 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE David Howells
  2023-02-13  8:28   ` Christoph Hellwig
@ 2023-02-13 10:11   ` David Howells
  2023-02-13 10:18     ` Christoph Hellwig
  2023-02-13 11:15     ` David Howells
  2023-02-13 18:06   ` Guenter Roeck
  2023-02-13 22:43   ` David Howells
  3 siblings, 2 replies; 35+ messages in thread
From: David Howells @ 2023-02-13 10:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Christoph Hellwig <hch@infradead.org> wrote:

> Also why doesn't this use init_sync_kiocb?

I'm not sure I want ki_flags.

> >  	if (in->f_flags & O_DIRECT)
> >  		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
> > +	return generic_file_buffered_splice_read(in, ppos, pipe, len, flags);
> 
> Btw, can we drop the verbose generic_file_ prefix here?

Probably.  Note that at some point cifs, for example, running in "unbuffered"
mode might want to call [generic_file_]direct_splice_read() directly.

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-13 10:11   ` David Howells
@ 2023-02-13 10:18     ` Christoph Hellwig
  2023-02-13 11:15     ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2023-02-13 10:18 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

On Mon, Feb 13, 2023 at 10:11:01AM +0000, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > Also why doesn't this use init_sync_kiocb?
> 
> I'm not sure I want ki_flags.

Why?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-13 10:11   ` David Howells
  2023-02-13 10:18     ` Christoph Hellwig
@ 2023-02-13 11:15     ` David Howells
  2023-02-13 14:44       ` Christoph Hellwig
  1 sibling, 1 reply; 35+ messages in thread
From: David Howells @ 2023-02-13 11:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Christoph Hellwig <hch@infradead.org> wrote:

> > > Also why doesn't this use init_sync_kiocb?
> > 
> > I'm not sure I want ki_flags.
> 
> Why?

I'm not sure I want ki_flags setting from f_iocb_flags I should've said.  I'm
not sure how the IOCB_* flags that I import from there will affect the
operation of the synchronous read splice.  IOCB_NOWAIT, for example, or, for
that matter, IOCB_APPEND.

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-13 11:15     ` David Howells
@ 2023-02-13 14:44       ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2023-02-13 14:44 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

On Mon, Feb 13, 2023 at 11:15:37AM +0000, David Howells wrote:
> I'm not sure I want ki_flags setting from f_iocb_flags I should've said.  I'm
> not sure how the IOCB_* flags that I import from there will affect the
> operation of the synchronous read splice.

The same way as they did in the old ITER_PIPE based
generic_file_splice_read that uses init_sync_kiocb?  And if there's any
questions about them we need to do a deep audit.

> IOCB_NOWAIT, for example, or, for

I'd expect a set IOCB_NOWAIT to make the function return -EAGAIN
when it has to block.

> that matter, IOCB_APPEND.

IOCB_APPEND has no effect on reads of any kind.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE David Howells
  2023-02-13  8:28   ` Christoph Hellwig
  2023-02-13 10:11   ` David Howells
@ 2023-02-13 18:06   ` Guenter Roeck
  2023-02-13 22:43   ` David Howells
  3 siblings, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-02-13 18:06 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Hi,

On Thu, Feb 09, 2023 at 10:29:45AM +0000, David Howells wrote:
> Provide a function to do splice read from a buffered file, pulling the
> folios out of the pagecache directly by calling filemap_get_pages() to do
> any required reading and then pasting the returned folios into the pipe.
> 
> A helper function is provided to do the actual folio pasting and will
> handle multipage folios by splicing as many of the relevant subpages as
> will fit into the pipe.
> 
> The ITER_BVEC-based splicing previously added is then only used for
> splicing from O_DIRECT files.
> 
> The code is loosely based on filemap_read() and might belong in
> mm/filemap.c with that as it needs to use filemap_get_pages().
> 
> With this, ITER_PIPE is no longer used.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>

With this patch in the tree, the "collie" and "mps2" qemu emulations
crash for me. Crash logs are attached. I also attached the bisect log
for "collie".

Unfortunately I can not revert the patch to confirm because the revert
results in compile failures.

Guenter
---
bisect log

# bad: [09e41676e35ab06e4bce8870ea3bf1f191c3cb90] Add linux-next specific files for 20230213
# good: [4ec5183ec48656cec489c49f989c508b68b518e3] Linux 6.2-rc7
git bisect start 'HEAD' 'v6.2-rc7'
# good: [8b065aee8dfbecc978324b204fc897168c9adcd0] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
git bisect good 8b065aee8dfbecc978324b204fc897168c9adcd0
# bad: [72655d7bf4966cc46ac85ef74b26eb74e251ae4a] Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
git bisect bad 72655d7bf4966cc46ac85ef74b26eb74e251ae4a
# good: [55461ffd2b7ee0a8fe4a1f98ae6f4a33771e8193] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git
git bisect good 55461ffd2b7ee0a8fe4a1f98ae6f4a33771e8193
# bad: [0f1bf464790dad200077e97d35cd8bb9dd7b8341] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git
git bisect bad 0f1bf464790dad200077e97d35cd8bb9dd7b8341
# good: [c72ebd41e0737e1f1d30dc6eb3d167e8d16dcc3a] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git
git bisect good c72ebd41e0737e1f1d30dc6eb3d167e8d16dcc3a
# bad: [501053535caca01f20a9323d3c8dec9ecb7a06b1] Merge branch 'for-6.3/iov-extract' into for-next
git bisect bad 501053535caca01f20a9323d3c8dec9ecb7a06b1
# good: [efde918ac66958c568926120841e7692b1e9bd9d] rxrpc: use bvec_set_page to initialize a bvec
git bisect good efde918ac66958c568926120841e7692b1e9bd9d
# good: [6938b812a638d9f02d3eb4fd07c7aab4fd44076d] Merge branch 'for-6.3/io_uring' into for-next
git bisect good 6938b812a638d9f02d3eb4fd07c7aab4fd44076d
# good: [1972d038a5401781377d3ce2d901bf7763a43589] ublk: pass NULL to blk_mq_alloc_disk() as queuedata
git bisect good 1972d038a5401781377d3ce2d901bf7763a43589
# good: [f37bf75ca73d523ebaa7ceb44c45d8ecd05374fe] block, bfq: cleanup 'bfqg->online'
git bisect good f37bf75ca73d523ebaa7ceb44c45d8ecd05374fe
# bad: [34c5b3634708864d5845cbadad03833c30051e0b] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
git bisect bad 34c5b3634708864d5845cbadad03833c30051e0b
# bad: [d9722a47571104f7fa1eeb5ec59044d3607c6070] splice: Do splice read from a buffered file without using ITER_PIPE
git bisect bad d9722a47571104f7fa1eeb5ec59044d3607c6070
# good: [cd119d2fa647945d63941d3fd64f4acc9f6eec24] mm: Pass info, not iter, into filemap_get_pages() and unstatic it
git bisect good cd119d2fa647945d63941d3fd64f4acc9f6eec24
# first bad commit: [d9722a47571104f7fa1eeb5ec59044d3607c6070] splice: Do splice read from a buffered file without using ITER_PIPE

---
arm:collie crash

8<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address 00000000 when execute
[00000000] *pgd=c14b4831c14b4831, *pte=c14b4000, *ppte=e09b5a14
8<--- cut here ---
Unhandled fault: page domain fault (0x01b) at 0x00000000
[00000000] *pgd=c14b4831, *pte=00000000, *ppte=00000000
Internal error: : 1b [#1] ARM
CPU: 0 PID: 58 Comm: cat Not tainted 6.2.0-rc7-next-20230213 #1
Hardware name: Sharp-Collie
PC is at copy_from_kernel_nofault+0x124/0x23c
LR is at 0xe09b5a84
pc : [<c009d894>]    lr : [<e09b5a84>]    psr: 20000193
sp : e09b5a4c  ip : e09b5a84  fp : e09b5a80
r10: 00000214  r9 : 60000113  r8 : 00000004
r7 : c08b91fc  r6 : e09b5a84  r5 : 00000004  r4 : 00000000
r3 : 00000001  r2 : c14a6ca0  r1 : 00000000  r0 : 00000001
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0000717f  Table: c14d4000  DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: NULL pointer
Register r2 information: slab task_struct start c14a6ca0 pointer offset 0 size 3232
Register r3 information: non-paged memory
Register r4 information: NULL pointer
Register r5 information: non-paged memory
Register r6 information: 2-page vmalloc region starting at 0xe09b4000 allocated at kernel_clone+0x78/0x474
Register r7 information: non-slab/vmalloc memory
Register r8 information: non-paged memory
Register r9 information: non-paged memory
Register r10 information: non-paged memory
Register r11 information: 2-page vmalloc region starting at 0xe09b4000 allocated at kernel_clone+0x78/0x474
Register r12 information: 2-page vmalloc region starting at 0xe09b4000 allocated at kernel_clone+0x78/0x474
Process cat (pid: 58, stack limit = 0xfabdb807)
Stack: (0xe09b5a4c to 0xe09b6000)
5a40:                            e09b5a70 e09b5a5c c005a4e4 c087b82c e09b5c14
5a60: e09b5c14 00000000 c08b91fc 80000005 60000113 e09b5a98 e09b5a84 c000cfa8
5a80: c009d77c e09b5ab0 c087b82c e09b5ac0 e09b5a9c c06b8020 c000cf84 c149a840
5aa0: c07df1e0 e09b5c14 c07df1c8 c087f3e4 c08b91fc e09b5b40 e09b5ac4 c000cc28
5ac0: c06b8010 e09b5af0 e09b5ad4 c005e7a4 c005d0b8 e09b5af8 e09b5ae4 c005d0d4
5ae0: c14b4000 e09b5b08 e09b5af4 c06de218 c005e72c e09b5b10 c087b82c e09b5b40
5b00: e09b5b1c c06dcba8 c06de1f4 c07ded54 c087b82c 00000000 80000005 00000000
5b20: c149a840 c07df1e0 c149a8b8 00000004 00000214 e09b5b58 e09b5b44 c000dcec
5b40: c000cb98 80000005 e09b5c14 e09b5b80 e09b5b5c c000dd78 c000dc84 e09b5c14
5b60: e09b5b6c e09b5c14 80000005 00000000 c149a840 e09b5bc0 e09b5b84 c000df6c
5b80: c000dd2c 00000000 c087ba8c c14a6ca0 00010000 c08bacc0 00000005 e09b5c14
5ba0: c087f688 c000e184 00000000 c14a6ca0 c149f158 e09b5bd8 e09b5bc4 c000e22c
5bc0: c000ddc0 00000005 e09b5c14 e09b5c10 e09b5bdc c000e33c c000e190 c14a6ca0
5be0: c00526a4 c14a6ca0 c088607c 60000013 00000000 20000013 ffffffff e09b5c48
5c00: dfb10900 e09b5cb4 e09b5c14 c0008e10 c000e304 c14b3a20 dfb10900 dfb10900
5c20: 60000093 dfb10900 00000000 c14b3a20 60000013 dfb10900 00000000 c149f158
5c40: e09b5cb4 e09b5cb8 e09b5c60 c0093a7c 00000000 20000013 ffffffff 00000051
5c60: c00a5a88 00000cc0 dfb10900 00000000 00000cc0 c149f128 e09b5cb4 e09b5c88
5c80: c009507c c087b82c e09b5c90 e09b5d94 e09b5d68 00000000 c149f128 dfb10900
5ca0: 00000000 c149f158 e09b5d3c e09b5cb8 c0099714 c0093a20 dfff1d14 c14a7258
5cc0: c00e28bc 00000002 e09b5d1c e09b5cd8 00010000 00000001 c14b3af0 c14b3a20
5ce0: c14a6ca0 00000010 00000001 c14b3a20 c149f128 c14b3af0 00000000 00000000
5d00: 00000000 00000000 00000000 c087b82c 80000113 00000000 00000000 e09b5f18
5d20: 00010000 c14b5600 00000000 00010000 e09b5e04 e09b5d40 c013019c c0099324
5d40: 60000113 00000000 c14a6ca0 c14a6ca0 e09b5d84 c14b3a20 c087ba8c c14a6ca0
5d60: 00000000 00000000 c14b3a20 00000000 00000000 00000000 00000000 00000000
5d80: 00000000 00000000 00000000 00000000 fffffffe ffff0000 000001a9 00000000
5da0: c832b34f c088607c 60000013 00000000 e09b5f18 c0a90ec8 01000000 c14b5600
5dc0: e09b5e04 e09b5dd0 c0055788 c0054dec 00000000 c087b82c c0102d90 c14b5600
5de0: 00000000 e09b5f18 e09b5f18 c14b3a20 00000000 00010000 e09b5e94 e09b5e08
5e00: c0130ec0 c01300d0 00000001 00000000 c0102d90 c14b5600 e09b5e9c e09b5e28
5e20: c06ef2a8 c005582c 00000001 00000000 c0102d90 e09b5e40 c0055f60 c0052e4c
5e40: e09b5e5c e09b5e50 c14b5634 00000001 00000001 00000002 c000f22c c149a894
5e60: 00000000 c087b82c c14d1e00 00010000 c14b3a20 c14b5600 e09b5f18 c0130e48
5e80: 01000000 00000001 e09b5ec4 e09b5e98 c012fee0 c0130e54 00000000 c06ef210
5ea0: c0102d90 00000000 c14b5600 c14b3a20 e09b5f18 00000000 e09b5ef4 e09b5ec8
5ec0: c0131d94 c012fe50 00000000 c0052e4c c14b3a20 00000000 00000000 01000000
5ee0: c14b3a20 c0c2aa20 e09b5f5c e09b5ef8 c00f72d8 c0131d28 00000000 c149a8b8
5f00: 00000002 00000255 c14b5600 00000000 c0c2aa20 c00f8f4c 00000000 00000000
5f20: 00000000 00000000 e09b5f74 c087b82c c000df18 00000000 00000000 00000003
5f40: 000000ef c0008420 c14a6ca0 01000000 e09b5fa4 e09b5f60 c00f8f4c c00f7170
5f60: 7fffffff 00000000 e09b5fac e09b5f78 c000e278 c087b82c befeee88 01000000
5f80: 00000000 01000000 000000ef c0008420 c14a6ca0 00000000 00000000 e09b5fa8
5fa0: c0008260 c00f8e38 01000000 00000000 00000001 00000003 00000000 01000000
5fc0: 01000000 00000000 01000000 000000ef 00000001 00000001 00000000 00000000
5fe0: b6e485d0 befedc74 00019764 b6e485dc 60000010 00000001 00000000 00000000
Backtrace:
 copy_from_kernel_nofault from is_valid_bugaddr+0x30/0x7c
 r9:60000113 r8:80000005 r7:c08b91fc r6:00000000 r5:e09b5c14 r4:e09b5c14
 is_valid_bugaddr from report_bug+0x1c/0x114
 report_bug from die+0x9c/0x398
 r7:c08b91fc r6:c087f3e4 r5:c07df1c8 r4:e09b5c14
 die from die_kernel_fault+0x74/0xa8
 r10:00000214 r9:00000004 r8:c149a8b8 r7:c07df1e0 r6:c149a840 r5:00000000
 r4:80000005
 die_kernel_fault from __do_kernel_fault.part.0+0x58/0x94
 r7:e09b5c14 r4:80000005
 __do_kernel_fault.part.0 from do_page_fault+0x1b8/0x338
 r7:c149a840 r6:00000000 r5:80000005 r4:e09b5c14
 do_page_fault from do_translation_fault+0xa8/0xb0
 r10:c149f158 r9:c14a6ca0 r8:00000000 r7:c000e184 r6:c087f688 r5:e09b5c14
 r4:00000005
 do_translation_fault from do_PrefetchAbort+0x44/0x98
 r5:e09b5c14 r4:00000005
 do_PrefetchAbort from __pabt_svc+0x50/0x80
Exception stack(0xe09b5c14 to 0xe09b5c5c)
5c00:                                              c14b3a20 dfb10900 dfb10900
5c20: 60000093 dfb10900 00000000 c14b3a20 60000013 dfb10900 00000000 c149f158
5c40: e09b5cb4 e09b5cb8 e09b5c60 c0093a7c 00000000 20000013 ffffffff
 r8:dfb10900 r7:e09b5c48 r6:ffffffff r5:20000013 r4:00000000
 filemap_read_folio from filemap_get_pages+0x3fc/0x7a4
 r10:c149f158 r9:00000000 r8:dfb10900 r7:c149f128 r6:00000000 r5:e09b5d68
 r4:e09b5d94
 filemap_get_pages from generic_file_buffered_splice_read.constprop.0+0xd8/0x400
 r10:00010000 r9:00000000 r8:c14b5600 r7:00010000 r6:e09b5f18 r5:00000000
 r4:00000000
 generic_file_buffered_splice_read.constprop.0 from generic_file_splice_read+0x78/0x310
 r10:00010000 r9:00000000 r8:c14b3a20 r7:e09b5f18 r6:e09b5f18 r5:00000000
 r4:c14b5600
 generic_file_splice_read from do_splice_to+0x9c/0xbc
 r10:00000001 r9:01000000 r8:c0130e48 r7:e09b5f18 r6:c14b5600 r5:c14b3a20
 r4:00010000
 do_splice_to from splice_file_to_pipe+0x78/0x80
 r8:00000000 r7:e09b5f18 r6:c14b3a20 r5:c14b5600 r4:00000000
 splice_file_to_pipe from do_sendfile+0x174/0x59c
 r9:c0c2aa20 r8:c14b3a20 r7:01000000 r6:00000000 r5:00000000 r4:c14b3a20
 do_sendfile from sys_sendfile64+0x120/0x148
 r10:01000000 r9:c14a6ca0 r8:c0008420 r7:000000ef r6:00000003 r5:00000000
 r4:00000000
 sys_sendfile64 from ret_fast_syscall+0x0/0x44
Exception stack(0xe09b5fa8 to 0xe09b5ff0)
5fa0:                   01000000 00000000 00000001 00000003 00000000 01000000
5fc0: 01000000 00000000 01000000 000000ef 00000001 00000001 00000000 00000000
5fe0: b6e485d0 befedc74 00019764 b6e485dc
 r10:00000000 r9:c14a6ca0 r8:c0008420 r7:000000ef r6:01000000 r5:00000000
 r4:01000000
Code: e21e1003 1a000011 e3550003 9a00000f (e4943000)
---[ end trace 0000000000000000 ]---

---
arm:mps2

[    4.659693]
[    4.659693] Unhandled exception: IPSR = 00000006 LR = fffffff1
[    4.659888] CPU: 0 PID: 155 Comm: cat Tainted: G                 N 6.2.0-rc7-next-20230213 #1
[    4.660030] Hardware name: Generic DT based system
[    4.660118] PC is at 0x0
[    4.660248] LR is at filemap_read_folio+0x17/0x4e
[    4.660468] pc : [<00000000>]    lr : [<21044c97>]    psr: 0000000b
[    4.660534] sp : 2185bd10  ip : 2185bcd0  fp : 00080001
[    4.660591] r10: 21757b40  r9 : 2175eea4  r8 : 2185bdf4
[    4.660649] r7 : 2175ee88  r6 : 21757b40  r5 : 21fecb40  r4 : 00000000
[    4.660718] r3 : 00000001  r2 : 00000001  r1 : 21fecb40  r0 : 21757b40
[    4.660785] xPSR: 0000000b
[    4.661126]  filemap_read_folio from filemap_get_pages+0x127/0x36e
[    4.661247]  filemap_get_pages from generic_file_buffered_splice_read.constprop.5+0x85/0x244
[    4.661342]  generic_file_buffered_splice_read.constprop.5 from generic_file_splice_read+0x1c3/0x1e2
[    4.661436]  generic_file_splice_read from splice_file_to_pipe+0x2f/0x48
[    4.661509]  splice_file_to_pipe from do_sendfile+0x193/0x1b8
[    4.661573]  do_sendfile from sys_sendfile64+0x63/0x70
[    4.661653]  sys_sendfile64 from ret_fast_syscall+0x1/0x4c
[    4.661740] Exception stack(0x2185bfa8 to 0x2185bff0)
[    4.661854] bfa0:                   000000ef 00000000 00000001 00000003 00000000 01000000
[    4.661944] bfc0: 000000ef 00000000 21b56e48 000000ef 00000001 00000001 00000000 21b51770
[    4.662023] bfe0: 21b4e791 21b56e48 21b174c9 21b0265e

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-09 10:29 ` [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE David Howells
                     ` (2 preceding siblings ...)
  2023-02-13 18:06   ` Guenter Roeck
@ 2023-02-13 22:43   ` David Howells
  2023-02-13 22:51     ` Guenter Roeck
  2023-02-13 23:12     ` David Howells
  3 siblings, 2 replies; 35+ messages in thread
From: David Howells @ 2023-02-13 22:43 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jan Kara, Jeff Layton, David Hildenbrand, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Christoph Hellwig, John Hubbard

Guenter Roeck <linux@roeck-us.net> wrote:

> [    4.660118] PC is at 0x0
> [    4.660248] LR is at filemap_read_folio+0x17/0x4e

Do you know what the filesystem is that's being read from?

I think the problem is that there are a few filesystems/drivers that call
generic_file_splice_read() but don't have a ->read_folio().  Now most of these
can be made to call direct_splice_read() instead, leaving just coda, overlayfs
and shmem.

Coda and overlayfs can be made to pass the request down a layer.  I'm about to
look into shmem.

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-13 22:43   ` David Howells
@ 2023-02-13 22:51     ` Guenter Roeck
  2023-02-13 23:12     ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-02-13 22:51 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

On 2/13/23 14:43, David Howells wrote:
> Guenter Roeck <linux@roeck-us.net> wrote:
> 
>> [    4.660118] PC is at 0x0
>> [    4.660248] LR is at filemap_read_folio+0x17/0x4e
> 
> Do you know what the filesystem is that's being read from?
> 
> I think the problem is that there are a few filesystems/drivers that call
> generic_file_splice_read() but don't have a ->read_folio().  Now most of these
> can be made to call direct_splice_read() instead, leaving just coda, overlayfs
> and shmem.
> 
> Coda and overlayfs can be made to pass the request down a layer.  I'm about to
> look into shmem.
> 

Both are initrd.

Guenter


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-13 22:43   ` David Howells
  2023-02-13 22:51     ` Guenter Roeck
@ 2023-02-13 23:12     ` David Howells
  2023-02-13 23:25       ` Guenter Roeck
  1 sibling, 1 reply; 35+ messages in thread
From: David Howells @ 2023-02-13 23:12 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jan Kara, Jeff Layton, David Hildenbrand, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Christoph Hellwig, John Hubbard

Guenter Roeck <linux@roeck-us.net> wrote:

> Both are initrd.

Do you mean rootfs?  And, if so, is that tmpfs-based or ramfs-based?

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE
  2023-02-13 23:12     ` David Howells
@ 2023-02-13 23:25       ` Guenter Roeck
  0 siblings, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-02-13 23:25 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

On 2/13/23 15:12, David Howells wrote:
> Guenter Roeck <linux@roeck-us.net> wrote:
> 
>> Both are initrd.
> 
> Do you mean rootfs?  And, if so, is that tmpfs-based or ramfs-based?
> 

Both are provided to the kernel using the -initrd qemu option,
which usually means that the address/location is passed to the kernel
through either a register or a data structure. I have not really paid
much attention to what the kernel is doing with that information.
It is in cpio format, so it must be decompressed, but I don't know how
it is actually handled (nor why this doesn't fail on other boots
from initrd).

Guenter


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v14 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-09 16:15   ` [PATCH v14 " David Howells
  2023-02-13  8:22     ` Christoph Hellwig
@ 2023-02-15 13:17     ` David Howells
  2023-02-15 14:24       ` Christoph Hellwig
  2023-02-15 15:56       ` David Howells
  2023-02-15 13:42     ` [PATCH] splice: Clean up direct_splice_read() a bit David Howells
  2023-02-15 13:47     ` David Howells
  3 siblings, 2 replies; 35+ messages in thread
From: David Howells @ 2023-02-15 13:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Matthew Wilcox, Jens Axboe, Al Viro, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Christoph Hellwig <hch@infradead.org> wrote:

> > +	pages = (void *)(bv + npages);
> 
> I think this cast should be to struct page **… not void *.

Yeah.  Doesn't change anything functionally, though, I think.

> > +	/* Free any pages that didn't get touched at all. */
> > +	reclaim /= PAGE_SIZE;
> 
> Any reason not to keep reclaim in PAGE_SIZE units to start with?

Probably not, but I don't want to fiddle with that right now.  I can send a
follow up patch for it.

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] splice: Clean up direct_splice_read() a bit
  2023-02-09 16:15   ` [PATCH v14 " David Howells
  2023-02-13  8:22     ` Christoph Hellwig
  2023-02-15 13:17     ` David Howells
@ 2023-02-15 13:42     ` David Howells
  2023-02-15 13:47     ` David Howells
  3 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-15 13:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Matthew Wilcox, Jens Axboe, Al Viro, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

How about the attached?  I won't fold it down for the moment, but it could be
pushed along later.

David
---
splice: Clean up direct_splice_read() a bit

Do a couple of cleanups to direct_splice_read():

 (1) Cast to struct page **, not void *.

 (2) Simplify the calculation of the number of pages to keep/reclaim in
     direct_splice_read().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org

diff --git a/fs/splice.c b/fs/splice.c
index 9e798c901087..572d3e2a669a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -295,7 +295,7 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	struct kiocb kiocb;
 	struct page **pages;
 	ssize_t ret;
-	size_t used, npages, chunk, remain, reclaim;
+	size_t used, npages, chunk, remain, keep = 0;
 	int i;
 
 	/* Work out how much data we can actually add into the pipe */
@@ -332,11 +332,8 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
 
-	reclaim = npages * PAGE_SIZE;
-	remain = 0;
 	if (ret > 0) {
-		reclaim -= ret;
-		remain = ret;
+		keep = DIV_ROUND_UP(ret, PAGE_SIZE);
 		*ppos = kiocb.ki_pos;
 		file_accessed(in);
 	} else if (ret < 0) {
@@ -349,14 +346,12 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	}
 
 	/* Free any pages that didn't get touched at all. */
-	reclaim /= PAGE_SIZE;
-	if (reclaim) {
-		npages -= reclaim;
-		release_pages(pages + npages, reclaim);
-	}
+	if (keep < npages)
+		release_pages(pages + keep, npages - keep);
 
 	/* Push the remaining pages into the pipe. */
-	for (i = 0; i < npages; i++) {
+	remain = ret;
+	for (i = 0; i < keep; i++) {
 		struct pipe_buffer *buf = pipe_head_buf(pipe);
 
 		chunk = min_t(size_t, remain, PAGE_SIZE);


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] splice: Clean up direct_splice_read() a bit
  2023-02-09 16:15   ` [PATCH v14 " David Howells
                       ` (2 preceding siblings ...)
  2023-02-15 13:42     ` [PATCH] splice: Clean up direct_splice_read() a bit David Howells
@ 2023-02-15 13:47     ` David Howells
  3 siblings, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-15 13:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Matthew Wilcox, Jens Axboe, Al Viro, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

I forgot to commit the cast change too.  Try the attached instead.

David
---
splice: Clean up direct_splice_read() a bit

Do a couple of cleanups to direct_splice_read():

 (1) Cast to struct page **, not void *.

 (2) Simplify the calculation of the number of pages to keep/reclaim in
     direct_splice_read().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---

diff --git a/fs/splice.c b/fs/splice.c
index 9e798c901087..e97f9aa30717 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -295,7 +295,7 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	struct kiocb kiocb;
 	struct page **pages;
 	ssize_t ret;
-	size_t used, npages, chunk, remain, reclaim;
+	size_t used, npages, chunk, remain, keep = 0;
 	int i;
 
 	/* Work out how much data we can actually add into the pipe */
@@ -309,7 +309,7 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	if (!bv)
 		return -ENOMEM;
 
-	pages = (void *)(bv + npages);
+	pages = (struct page **)(bv + npages);
 	npages = alloc_pages_bulk_array(GFP_USER, npages, pages);
 	if (!npages) {
 		kfree(bv);
@@ -332,11 +332,8 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
 
-	reclaim = npages * PAGE_SIZE;
-	remain = 0;
 	if (ret > 0) {
-		reclaim -= ret;
-		remain = ret;
+		keep = DIV_ROUND_UP(ret, PAGE_SIZE);
 		*ppos = kiocb.ki_pos;
 		file_accessed(in);
 	} else if (ret < 0) {
@@ -349,14 +346,12 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	}
 
 	/* Free any pages that didn't get touched at all. */
-	reclaim /= PAGE_SIZE;
-	if (reclaim) {
-		npages -= reclaim;
-		release_pages(pages + npages, reclaim);
-	}
+	if (keep < npages)
+		release_pages(pages + keep, npages - keep);
 
 	/* Push the remaining pages into the pipe. */
-	for (i = 0; i < npages; i++) {
+	remain = ret;
+	for (i = 0; i < keep; i++) {
 		struct pipe_buffer *buf = pipe_head_buf(pipe);
 
 		chunk = min_t(size_t, remain, PAGE_SIZE);


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v14 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-15 13:17     ` David Howells
@ 2023-02-15 14:24       ` Christoph Hellwig
  2023-02-15 15:56       ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2023-02-15 14:24 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Matthew Wilcox, Jens Axboe, Al Viro, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

On Wed, Feb 15, 2023 at 01:17:56PM +0000, David Howells wrote:
> Probably not, but I don't want to fiddle with that right now.  I can send a
> follow up patch for it.

Honestly, I think this rush for 6.3 inclusion is a really bad idea.

This series fundamentally changes how splice reads work, and has only
been out for about a week.  It hasn't even been Cc'ed to Al and Linus
which generally have a good knowledge of the splice code and an opinion
on it.

I think it is a good change, but I'd feel much more comfortable with
it for the next merge window rather than rushing it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v14 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE
  2023-02-15 13:17     ` David Howells
  2023-02-15 14:24       ` Christoph Hellwig
@ 2023-02-15 15:56       ` David Howells
  1 sibling, 0 replies; 35+ messages in thread
From: David Howells @ 2023-02-15 15:56 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, smfrench
  Cc: dhowells, Matthew Wilcox, Al Viro, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Feb 15, 2023 at 01:17:56PM +0000, David Howells wrote:
> > Probably not, but I don't want to fiddle with that right now.  I can send a
> > follow up patch for it.
> 
> Honestly, I think this rush for 6.3 inclusion is a really bad idea.
>
> This series fundamentally changes how splice reads work, and has only
> been out for about a week.  It hasn't even been Cc'ed to Al

Sorry, what?!  Al has been To'd or cc'd on every patch.

> and Linus

I don't know that it's necessary to cc Linus on everything.  Jens is the
splice maintainer, I thought.

> which generally have a good knowledge of the splice code and an opinion
> on it.
> 
> I think it is a good change, but I'd feel much more comfortable with
> it for the next merge window rather than rushing it.

The lack of iov_iter_extract_pages() is blocking other things I want to work
on - and will push those out another 3 months further beyond this.

I'm fine with dropping the block layer changes and most of the splice changes,
but I do want to try to get patches 1-3, 10 and 11:

 mm: Pass info, not iter, into filemap_get_pages()
 splice: Add a func to do a splice from a buffered file without ITER_PIPE
 splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE
 iov_iter: Add a function to extract a page list from an iterator
 iov_iter: Define flags to qualify page extraction.

upstream through the cifs tree if you, Jens and Steve French have no
objection, with my cifs iteratorisation patches on top.  It shouldn't affect
anything other than cifs in this merge window, barring the change to the flags
to iov_iter_get_pages*().

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-02-15 15:57 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-09 10:29 [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) David Howells
2023-02-09 10:29 ` [PATCH v13 01/12] splice: Fix O_DIRECT file read splice to avoid reversion of ITER_PIPE David Howells
2023-02-09 14:53   ` Matthew Wilcox
2023-02-09 15:06   ` David Howells
2023-02-09 16:15   ` [PATCH v14 " David Howells
2023-02-13  8:22     ` Christoph Hellwig
2023-02-15 13:17     ` David Howells
2023-02-15 14:24       ` Christoph Hellwig
2023-02-15 15:56       ` David Howells
2023-02-15 13:42     ` [PATCH] splice: Clean up direct_splice_read() a bit David Howells
2023-02-15 13:47     ` David Howells
2023-02-09 10:29 ` [PATCH v13 02/12] mm: Pass info, not iter, into filemap_get_pages() and unstatic it David Howells
2023-02-13  8:22   ` Christoph Hellwig
2023-02-09 10:29 ` [PATCH v13 03/12] splice: Do splice read from a buffered file without using ITER_PIPE David Howells
2023-02-13  8:28   ` Christoph Hellwig
2023-02-13 10:11   ` David Howells
2023-02-13 10:18     ` Christoph Hellwig
2023-02-13 11:15     ` David Howells
2023-02-13 14:44       ` Christoph Hellwig
2023-02-13 18:06   ` Guenter Roeck
2023-02-13 22:43   ` David Howells
2023-02-13 22:51     ` Guenter Roeck
2023-02-13 23:12     ` David Howells
2023-02-13 23:25       ` Guenter Roeck
2023-02-09 10:29 ` [PATCH v13 04/12] iov_iter: Kill ITER_PIPE David Howells
2023-02-13  8:28   ` Christoph Hellwig
2023-02-09 10:29 ` [PATCH v13 05/12] iov_iter: Define flags to qualify page extraction David Howells
2023-02-09 10:29 ` [PATCH v13 06/12] iov_iter: Add a function to extract a page list from an iterator David Howells
2023-02-09 10:29 ` [PATCH v13 07/12] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
2023-02-09 10:29 ` [PATCH v13 08/12] block: Fix bio_flagged() so that gcc can better optimise it David Howells
2023-02-09 10:29 ` [PATCH v13 09/12] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
2023-02-09 10:29 ` [PATCH v13 10/12] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
2023-02-09 10:29 ` [PATCH v13 11/12] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
2023-02-09 10:29 ` [PATCH v13 12/12] block: convert bio_map_user_iov " David Howells
2023-02-10 22:31 ` [PATCH v13 00/12] iov_iter: Improve page extraction (pin or just list) Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).