Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Dan Carpenter @ 2023-01-07  8:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Tejun Heo, Jens Axboe, Christoph Hellwig, linux-block,
	linux-kernel
In-Reply-To: <Y7iIWA6h88cYjhcO@bombadil.infradead.org>

On Fri, Jan 06, 2023 at 12:45:12PM -0800, Luis Chamberlain wrote:
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reported-by: Dan Carpenter <error27@gmail.com>
> > Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Luis Chamberlain <mcgrof@kernel.org>
> > Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal") # v5.9+
> 
> *tons* has changed since e8c7d14ac6c3 and so the bots might think that
> *if* this patch is applied upstream it is justified for older kernels
> and I don't think that's yet been verified and doubt it.

If you enable the correct debug option then the might_sleep() causes a
stack trace.  Eventually syzbot will find it.

I would backport it.

regards,
dan carpenter

^ permalink raw reply

* Re: [External] Re: [PATCH v3] blk-throtl: Introduce sync and async queues for blk-throtl
From: hanjinke @ 2023-01-07  4:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Michal Koutný, josef, axboe, cgroups, linux-block,
	linux-kernel, yinxin.x
In-Reply-To: <Y7hlX4T1UOmQHiGf@slm.duckdns.org>

在 2023/1/7 上午2:15, Tejun Heo 写道:
> Hello,
> 
> On Sat, Jan 07, 2023 at 02:07:38AM +0800, hanjinke wrote:
>> In our internal scenario, iocost has been deployed as the main io isolation
>> method and is gradually spreading。
> 
> Ah, glad to hear. If you don't mind sharing, how are you configuring iocost
> currently? How do you derive the parameters?
> 

For cost.model setting, We first use the tools iocost provided to test 
the benchmark model parameters of different types of disks online, and 
then save these benchmark parameters to a parametric Model Table. During 
the deployment process, pull and set the corresponding model parameters 
according to the type of disk.

The setting of cost.qos should be considered slightly more，we need to 
make some compromises between overall disk throughput and io latency.
The average disk utilization of the entire disk on a specific business 
and the RLA（if it is io sensitive） of key businesses will be taken as 
important input considerations. The cost.qos will be dynamically 
fine-tuned according to the health status monitoring of key businesses.

For cost.weight setting, high-priority services  will gain greater 
advantages through weight settings to deal with a large number of io 
requests in a short period of time. It works fine as work-conservation
of iocost works well according to our observation.

These practices can be done better and I look forward to your better 
suggestions.

> blk-throttle has a lot of issues which may be difficult to address. Even the
> way it's configured is pretty difficult to scale across different hardware /
> application combinations and we've neglected its control performance and
> behavior (like handling of shared IOs) for quite a while.
> 
> While iocost's work-conserving control does address a lot of the use cases
> we see today, it's likely that we'll need hard limits more in the future
> too. I've been thinking about implementing io.max on top of iocost. There
> are some challenges around dynamic vrate adj semantics but it's kinda
> attractive because iocost already has the concept of total device capacity.

Indeed in our multi-tenancy scenario, the hard limits are necessary.

Jinke
Thanks.

^ permalink raw reply

* Re: [External] [LSF/MM/BPF BoF] Session for Zoned Storage 2023
From: Viacheslav A.Dubeyko @ 2023-01-07  1:56 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Luis Chamberlain, Linux FS Devel, linux-block,
	Javier González, Matias Bjørling, Bart Van Assche,
	Damien Le Moal, Adam Manzanares, Hans Holmberg, lsf-pc
In-Reply-To: <4CC4F55E-17B3-47E2-A8C5-9098CCEB65D6@dubeyko.com>

> On Jan 6, 2023, at 11:30 AM, Viacheslav Dubeyko <slava@dubeyko.com> wrote:
> 
> 
> 
>> On Jan 6, 2023, at 11:18 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>> 
>> On Fri, Jan 06, 2023 at 11:17:19AM -0800, Viacheslav Dubeyko wrote:
>>> Hello,
>>> 
>>> As far as I can see, I have two topics for discussion.
>> 
>> What's that?
> 
> I am going to share these topics in separate emails. :)
> 
> (1) I am going to share SSDFS patchset soon. And topic is:
> SSDFS + ZNS SSD: deterministic architecture decreasing TCO cost of data infrastructure.
> 
> (2) Second topic is:
> How to achieve better lifetime and performance of caching layer with ZNS SSD?
> 

I think we can consider such discussions:
(1) I assume that we still need to discuss PO2 zone sizes?
(2) Status of ZNS SSD support in F2FS, btrfs (maybe, bcachefs and other file systems)
(3) Any news from ZoneFS (+ ZenFS maybe)?
(4) New ZNS standard features that we need to support on block layer + FS levels?
(5) ZNS drive emulation + additional testing features?
(6) ZNS + computational drive? What new features would we like to see from ZNS SSD?
(7) ZNS + CXL: does it make sense?

Thanks,
Slava

^ permalink raw reply

* [PATCH v4 7/7] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
From: David Howells @ 2023-01-07  0:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Logan Gunthorpe,
	dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel
In-Reply-To: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk>

Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

A field, bi_cleanup_mode, is added to the bio struct that gets set by
iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
flags could also be used in future.

Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
indicate that attached pages are ref'd by default.  Cloning sets it to 0.
__bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
indicates.

[!] Note that this is tested a bit with ext4, but nothing else.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
---

 block/bio.c               |   47 +++++++++++++++++++++++++++++++++------------
 include/linux/blk_types.h |    1 +
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..eafcbeba0bab 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,11 @@ static void bio_free(struct bio *bio)
  * Users of this function have their own bio allocation. Subsequently,
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting bi_cleanup_mode to FOLL_GET, but this
+ * should be set to FOLL_PIN if the page should be unpinned instead; if the
+ * pages should not be put or unpinned, this should be set to 0
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -274,6 +279,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 	bio->bi_integrity = NULL;
 #endif
+	bio->bi_cleanup_mode = FOLL_GET;
 	bio->bi_vcnt = 0;
 
 	atomic_set(&bio->__bi_remaining, 1);
@@ -302,6 +308,7 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
 {
 	bio_uninit(bio);
 	memset(bio, 0, BIO_RESET_BYTES);
+	bio->bi_cleanup_mode = FOLL_GET;
 	atomic_set(&bio->__bi_remaining, 1);
 	bio->bi_bdev = bdev;
 	if (bio->bi_bdev)
@@ -814,6 +821,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 	bio_set_flag(bio, BIO_CLONED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
+	bio->bi_cleanup_mode = 0;
 
 	if (bio->bi_bdev) {
 		if (bio->bi_bdev == bio_src->bi_bdev &&
@@ -1168,6 +1176,18 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
 	return bio_add_page(bio, &folio->page, len, off) > 0;
 }
 
+/*
+ * Clean up a page according to the mode indicated by iov_iter_extract_pages(),
+ * where the page is may be pinned or may have a ref taken on it.
+ */
+static void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio->bi_cleanup_mode & FOLL_PIN)
+		unpin_user_page(page);
+	if (bio->bi_cleanup_mode & FOLL_GET)
+		put_page(page);
+}
+
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
 	struct bvec_iter_all iter_all;
@@ -1176,7 +1196,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1213,7 +1233,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1227,7 +1247,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1238,10 +1258,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Pins pages from *iter and appends them to @bio's bvec array.  The pages will
+ * have to be released using put_page() or unpin_user_page() when done as
+ * according to bi_cleanup_mode.  For multi-segment *iter, this function only
+ * adds pages from the next non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1273,9 +1293,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, gup_flags,
+				      &offset, &bio->bi_cleanup_mode);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1308,7 +1329,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1489,8 +1510,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..883f873a01ef 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -289,6 +289,7 @@ struct bio {
 #endif
 	};
 
+	unsigned int		bi_cleanup_mode; /* How to clean up pages */
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
 	/*



^ permalink raw reply related

* [PATCH v4 5/7] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
From: David Howells @ 2023-01-07  0:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Steve French, Shyam Prasad N, Rohith Surabattula,
	linux-cachefs, linux-cifs, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel
In-Reply-To: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk>

Add a function to extract the pages from a user-space supplied iterator
(UBUF- or IOVEC-type) into a BVEC-type iterator, retaining the pages by
getting a ref on them (ITER_SOURCE, ie. WRITE) or pinning them (ITER_DEST,
ie. READ) as we go.

This is useful in three situations:

 (1) A userspace thread may have a sibling that unmaps or remaps the
     process's VM during the operation, changing the assignment of the
     pages and potentially causing an error.  Retaining the pages keeps
     some pages around, even if this occurs; futher, we find out at the
     point of extraction if EFAULT is going to be incurred.

 (2) Pages might get swapped out/discarded if not retained, so we want to
     retain them to avoid the reload causing a deadlock due to a DIO
     from/to an mmapped region on the same file.

 (3) The iterator may get passed to sendmsg() by the filesystem.  If a
     fault occurs, we may get a short write to a TCP stream that's then
     tricky to recover from.

We don't deal with other types of iterator here, leaving it to other
mechanisms to retain the pages (eg. PG_locked, PG_writeback and the pipe
lock).

Changes:
========
ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697255265.61150.6289490555867717077.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732026503.3186319.12020462741051772825.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869690376.3723671.8813331570219190705.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920904810.1461876.11603559311247187100.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997422579.9475.12101700945635692496.stgit@warthog.procyon.org.uk/ # v3
---

 fs/netfs/Makefile     |    1 
 fs/netfs/iterator.c   |   99 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |    3 +
 3 files changed, 103 insertions(+)
 create mode 100644 fs/netfs/iterator.c

diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index f684c0cd1ec5..386d6fb92793 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -3,6 +3,7 @@
 netfs-y := \
 	buffered_read.o \
 	io.o \
+	iterator.o \
 	main.o \
 	objects.o
 
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
new file mode 100644
index 000000000000..7d802d21b9c5
--- /dev/null
+++ b/fs/netfs/iterator.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Iterator helpers.
+ *
+ * Copyright (C) 2022 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/export.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/netfs.h>
+#include "internal.h"
+
+/**
+ * netfs_extract_user_iter - Extract the pages from a user iterator into a bvec
+ * @orig: The original iterator
+ * @orig_len: The amount of iterator to copy
+ * @new: The iterator to be set up
+ * @cleanup_mode: Where to indicate the cleanup mode
+ *
+ * Extract the page fragments from the given amount of the source iterator and
+ * build up a second iterator that refers to all of those bits.  This allows
+ * the original iterator to disposed of.
+ *
+ * On success, the number of elements in the bvec is returned, the original
+ * iterator will have been advanced by the amount extracted and @*cleanup_mode
+ * will have been set to FOLL_GET, FOLL_PIN or 0.
+ */
+ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
+				struct iov_iter *new, unsigned int *cleanup_mode)
+{
+	struct bio_vec *bv = NULL;
+	struct page **pages;
+	unsigned int cur_npages;
+	unsigned int max_pages;
+	unsigned int npages = 0;
+	unsigned int i;
+	ssize_t ret;
+	size_t count = orig_len, offset, len;
+	size_t bv_size, pg_size;
+
+	if (WARN_ON_ONCE(!iter_is_ubuf(orig) && !iter_is_iovec(orig)))
+		return -EIO;
+
+	max_pages = iov_iter_npages(orig, INT_MAX);
+	bv_size = array_size(max_pages, sizeof(*bv));
+	bv = kvmalloc(bv_size, GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
+
+	*cleanup_mode = 0;
+
+	/* Put the page list at the end of the bvec list storage.  bvec
+	 * elements are larger than page pointers, so as long as we work
+	 * 0->last, we should be fine.
+	 */
+	pg_size = array_size(max_pages, sizeof(*pages));
+	pages = (void *)bv + bv_size - pg_size;
+
+	while (count && npages < max_pages) {
+		ret = iov_iter_extract_pages(orig, &pages, count,
+					     max_pages - npages, 0,
+					     &offset, cleanup_mode);
+		if (ret < 0) {
+			pr_err("Couldn't get user pages (rc=%zd)\n", ret);
+			break;
+		}
+
+		if (ret > count) {
+			pr_err("get_pages rc=%zd more than %zu\n", ret, count);
+			break;
+		}
+
+		count -= ret;
+		ret += offset;
+		cur_npages = DIV_ROUND_UP(ret, PAGE_SIZE);
+
+		if (npages + cur_npages > max_pages) {
+			pr_err("Out of bvec array capacity (%u vs %u)\n",
+			       npages + cur_npages, max_pages);
+			break;
+		}
+
+		for (i = 0; i < cur_npages; i++) {
+			len = ret > PAGE_SIZE ? PAGE_SIZE : ret;
+			bv[npages + i].bv_page	 = *pages++;
+			bv[npages + i].bv_offset = offset;
+			bv[npages + i].bv_len	 = len - offset;
+			ret -= len;
+			offset = 0;
+		}
+
+		npages += cur_npages;
+	}
+
+	iov_iter_bvec(new, iov_iter_rw(orig), bv, npages, orig_len - count);
+	return npages;
+}
+EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 4c76ddfb6a67..26fe3e6bafa1 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -296,6 +296,9 @@ void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
 void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
 			  bool was_async, enum netfs_sreq_ref_trace what);
 void netfs_stats_show(struct seq_file *);
+ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
+				struct iov_iter *new,
+				unsigned int *cleanup_mode);
 
 /**
  * netfs_inode - Get the netfs inode context from the inode



^ permalink raw reply related

* [PATCH v4 4/7] iov_iter: Add a function to extract a page list from an iterator
From: David Howells @ 2023-01-07  0:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, John Hubbard, Matthew Wilcox, linux-fsdevel,
	linux-mm, dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel
In-Reply-To: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk>

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer.

The function also indicates the mode of retention that was employed for an
iterator - and therefore how the caller should dispose of the pages later.

There are three cases:

 (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins obtained on them (but not references)
     so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
     progress.

     The indicated mode of retention will be FOLL_PIN for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have references obtained on them, but not pins.

     The indicated mode of retention will be FOLL_GET.  The caller should
     use something like put_page() for page disposal.

 (3) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     The indicated mode of retention will be 0.  The pages don't need
     additional disposal.

Changes:
========
vet #4)
 - Use ITER_SOURCE/DEST instead of WRITE/READ.
 - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166722777971.2555743.12953624861046741424.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732025748.3186319.8314014902727092626.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869689451.3723671.18242195992447653092.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
---

 include/linux/uio.h |    5 +
 lib/iov_iter.c      |  361 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 366 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index acb1ae3324ed..9a36b4cddb28 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -382,4 +382,9 @@ static inline void iov_iter_ubuf(struct iov_iter *i, enum iter_dir direction,
 	};
 }
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       unsigned int gup_flags,
+			       size_t *offset0, unsigned int *cleanup_mode);
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fec1c5513197..dc6db5ad108b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1914,3 +1914,364 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator.  This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0,
+					   unsigned int *cleanup_mode)
+{
+	unsigned int nr, offset, chunk, j;
+	struct page **p;
+	size_t left;
+
+	if (!sanity(i))
+		return -EFAULT;
+
+	offset = pipe_npages(i, &nr);
+	if (!nr)
+		return -EFAULT;
+	*offset0 = offset;
+
+	maxpages = min_t(size_t, nr, maxpages);
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	left = maxsize;
+	for (j = 0; j < maxpages; j++) {
+		struct page *page = append_pipe(i, left, &offset);
+		if (!page)
+			break;
+		chunk = min_t(size_t, left, PAGE_SIZE - offset);
+		left -= chunk;
+		*p++ = page;
+	}
+	if (!j)
+		return -EFAULT;
+	*cleanup_mode = 0;
+	return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     unsigned int gup_flags,
+					     size_t *offset0,
+					     unsigned int *cleanup_mode)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	i->iov_offset += maxsize;
+	i->count -= maxsize;
+	*cleanup_mode = 0;
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0,
+					   unsigned int *cleanup_mode)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	maxsize = min(maxsize, i->bvec->bv_len - skip);
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	i->count -= maxsize;
+	i->iov_offset += maxsize;
+	if (i->iov_offset == i->bvec->bv_len) {
+		i->iov_offset = 0;
+		i->bvec++;
+		i->nr_segs--;
+	}
+	*cleanup_mode = 0;
+	return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator.  The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+							 size_t *size)
+{
+	size_t skip;
+	long k;
+
+	if (iter_is_ubuf(i))
+		return (unsigned long)i->ubuf + i->iov_offset;
+
+	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->iov[k].iov_len - skip;
+
+		if (unlikely(!len))
+			continue;
+		if (*size > len)
+			*size = len;
+		return (unsigned long)i->iov[k].iov_base + skip;
+	}
+	BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int gup_flags,
+						   size_t *offset0,
+						   unsigned int *cleanup_mode)
+{
+	unsigned long addr;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(!iov_iter_is_source(i)))
+		return -EFAULT;
+
+	gup_flags |= FOLL_GET;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = iov_iter_extract_first_user_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	*cleanup_mode = FOLL_GET;
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int gup_flags,
+						   size_t *offset0,
+						   unsigned int *cleanup_mode)
+{
+	unsigned long addr;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(!iov_iter_is_dest(i)))
+		return -EFAULT;
+
+	gup_flags |= FOLL_PIN | FOLL_WRITE;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	*cleanup_mode = FOLL_PIN;
+	return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0,
+					   unsigned int *cleanup_mode)
+{
+	if (i->data_source)
+		return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+							   maxpages, gup_flags,
+							   offset0, cleanup_mode);
+	else
+		return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+							   maxpages, gup_flags,
+							   offset0, cleanup_mode);
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @gup_flags: Addition flags when getting pages from a user-backed iterator
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ * @cleanup_mode: Where to return the cleanup mode
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /OUT OF/ the described buffer, refs will be taken on the
+ *      pages, but pins will not be added.  This can be used for DMA from a
+ *      page; it cannot be used for DMA to a page, as it may cause page-COW
+ *      problems in fork.  *@cleanup_mode will be set to FOLL_GET.
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /INTO/ the described buffer, pins will be added to the
+ *      pages, but refs will not be taken.  This must be used for DMA to a
+ *      page.  *@cleanup_mode will be set to FOLL_PIN.
+ *
+ *  (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ *      data.  Additional pages may be allocated and added to the pipe (which
+ *      will hold the refs), but neither refs nor pins will be obtained for the
+ *      caller.  The caller must hold the pipe lock.  *@cleanup_mode will be
+ *      set to 0.
+ *
+ *  (*) If the iterator is ITER_BVEC or ITER_XARRAY, the pages are merely
+ *      listed; no extra refs or pins are obtained.  *@cleanup_mode will be set
+ *      to 0.
+ *
+ * Note also:
+ *
+ *  (*) Use with ITER_KVEC is not supported as that may refer to memory that
+ *      doesn't have associated page structs.
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page, *cleanup_mode to the
+ * cleanup required and returns the amount of buffer space added represented by
+ * the page list.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       unsigned int gup_flags,
+			       size_t *offset0,
+			       unsigned int *cleanup_mode)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0, cleanup_mode);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0, cleanup_mode);
+	if (iov_iter_is_pipe(i))
+		return iov_iter_extract_pipe_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0, cleanup_mode);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, gup_flags,
+						     offset0, cleanup_mode);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);



^ permalink raw reply related

* [PATCH v4 2/7] iov_iter: Use the direction in the iterator functions
From: David Howells @ 2023-01-07  0:33 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel
In-Reply-To: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk>

Use the direction in the iterator functions rather than READ/WRITE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
---

 include/linux/uio.h |   22 +++++++++++-----------
 lib/iov_iter.c      |   46 +++++++++++++++++++++++-----------------------
 2 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 4b0f4a773d90..acb1ae3324ed 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -261,16 +261,16 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 			unsigned len_mask);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
-void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,
+void iov_iter_init(struct iov_iter *i, enum iter_dir direction, const struct iovec *iov,
 			unsigned long nr_segs, size_t count);
-void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec *kvec,
+void iov_iter_kvec(struct iov_iter *i, enum iter_dir direction, const struct kvec *kvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
+void iov_iter_bvec(struct iov_iter *i, enum iter_dir direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
+void iov_iter_pipe(struct iov_iter *i, enum iter_dir direction, struct pipe_inode_info *pipe,
 			size_t count);
-void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
-void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
+void iov_iter_discard(struct iov_iter *i, enum iter_dir direction, size_t count);
+void iov_iter_xarray(struct iov_iter *i, enum iter_dir direction, struct xarray *xarray,
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
@@ -360,19 +360,19 @@ size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
 struct iovec *iovec_from_user(const struct iovec __user *uvector,
 		unsigned long nr_segs, unsigned long fast_segs,
 		struct iovec *fast_iov, bool compat);
-ssize_t import_iovec(int type, const struct iovec __user *uvec,
+ssize_t import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i);
-ssize_t __import_iovec(int type, const struct iovec __user *uvec,
+ssize_t __import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i, bool compat);
-int import_single_range(int type, void __user *buf, size_t len,
+int import_single_range(enum iter_dir direction, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i);
 
-static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
+static inline void iov_iter_ubuf(struct iov_iter *i, enum iter_dir direction,
 			void __user *buf, size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_UBUF,
 		.user_backed = true,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..fec1c5513197 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -421,11 +421,11 @@ size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(fault_in_iov_iter_writeable);
 
-void iov_iter_init(struct iov_iter *i, unsigned int direction,
+void iov_iter_init(struct iov_iter *i, enum iter_dir direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
 		.nofault = false,
@@ -994,11 +994,11 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
 
-void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
+void iov_iter_kvec(struct iov_iter *i, enum iter_dir direction,
 			const struct kvec *kvec, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter){
 		.iter_type = ITER_KVEC,
 		.data_source = direction,
@@ -1010,11 +1010,11 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_kvec);
 
-void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
+void iov_iter_bvec(struct iov_iter *i, enum iter_dir direction,
 			const struct bio_vec *bvec, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter){
 		.iter_type = ITER_BVEC,
 		.data_source = direction,
@@ -1026,15 +1026,15 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
+void iov_iter_pipe(struct iov_iter *i, enum iter_dir direction,
 			struct pipe_inode_info *pipe,
 			size_t count)
 {
-	BUG_ON(direction != READ);
+	BUG_ON(direction != ITER_DEST);
 	WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
 	*i = (struct iov_iter){
 		.iter_type = ITER_PIPE,
-		.data_source = false,
+		.data_source = ITER_DEST,
 		.pipe = pipe,
 		.head = pipe->head,
 		.start_head = pipe->head,
@@ -1057,10 +1057,10 @@ EXPORT_SYMBOL(iov_iter_pipe);
  * from evaporation, either by taking a ref on them or locking them by the
  * caller.
  */
-void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
+void iov_iter_xarray(struct iov_iter *i, enum iter_dir direction,
 		     struct xarray *xarray, loff_t start, size_t count)
 {
-	BUG_ON(direction & ~1);
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_XARRAY,
 		.data_source = direction,
@@ -1079,14 +1079,14 @@ EXPORT_SYMBOL(iov_iter_xarray);
  * @count: The size of the I/O buffer in bytes.
  *
  * Set up an I/O iterator that just discards everything that's written to it.
- * It's only available as a READ iterator.
+ * It's only available as a destination iterator.
  */
-void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
+void iov_iter_discard(struct iov_iter *i, enum iter_dir direction, size_t count)
 {
-	BUG_ON(direction != READ);
+	BUG_ON(direction != ITER_DEST);
 	*i = (struct iov_iter){
 		.iter_type = ITER_DISCARD,
-		.data_source = false,
+		.data_source = ITER_DEST,
 		.count = count,
 		.iov_offset = 0
 	};
@@ -1447,7 +1447,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		unsigned long addr;
 		int res;
 
-		if (iov_iter_rw(i) != WRITE)
+		if (iov_iter_is_dest(i))
 			gup_flags |= FOLL_WRITE;
 		if (i->nofault)
 			gup_flags |= FOLL_NOFAULT;
@@ -1784,7 +1784,7 @@ struct iovec *iovec_from_user(const struct iovec __user *uvec,
 	return iov;
 }
 
-ssize_t __import_iovec(int type, const struct iovec __user *uvec,
+ssize_t __import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i, bool compat)
 {
@@ -1823,7 +1823,7 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
 		total_len += len;
 	}
 
-	iov_iter_init(i, type, iov, nr_segs, total_len);
+	iov_iter_init(i, direction, iov, nr_segs, total_len);
 	if (iov == *iovp)
 		*iovp = NULL;
 	else
@@ -1836,7 +1836,7 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
  *     into the kernel, check that it is valid, and initialize a new
  *     &struct iov_iter iterator to access it.
  *
- * @type: One of %READ or %WRITE.
+ * @direction: One of %ITER_SOURCE or %ITER_DEST.
  * @uvec: Pointer to the userspace array.
  * @nr_segs: Number of elements in userspace array.
  * @fast_segs: Number of elements in @iov.
@@ -1853,16 +1853,16 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
  *
  * Return: Negative error code on error, bytes imported on success
  */
-ssize_t import_iovec(int type, const struct iovec __user *uvec,
+ssize_t import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs,
 		 struct iovec **iovp, struct iov_iter *i)
 {
-	return __import_iovec(type, uvec, nr_segs, fast_segs, iovp, i,
+	return __import_iovec(direction, uvec, nr_segs, fast_segs, iovp, i,
 			      in_compat_syscall());
 }
 EXPORT_SYMBOL(import_iovec);
 
-int import_single_range(int rw, void __user *buf, size_t len,
+int import_single_range(enum iter_dir direction, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i)
 {
 	if (len > MAX_RW_COUNT)
@@ -1872,7 +1872,7 @@ int import_single_range(int rw, void __user *buf, size_t len,
 
 	iov->iov_base = buf;
 	iov->iov_len = len;
-	iov_iter_init(i, rw, iov, 1, len);
+	iov_iter_init(i, direction, iov, 1, len);
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);



^ permalink raw reply related

* [PATCH v4 3/7] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
From: David Howells @ 2023-01-07  0:33 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel
In-Reply-To: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk>

If a kiocb or iomap_iter is available, then use the IOCB_WRITE flag or the
IOMAP_WRITE flag to determine whether we're writing rather than the
iterator direction flag.

This allows all but three of the users of iov_iter_rw() to be got rid of: a
consistency check and a warning statement in cifs and one user in the block
layer that has neither available.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
---

 block/fops.c         |    8 ++++----
 fs/9p/vfs_addr.c     |    2 +-
 fs/affs/file.c       |    4 ++--
 fs/ceph/file.c       |    2 +-
 fs/dax.c             |    6 +++---
 fs/direct-io.c       |   22 +++++++++++-----------
 fs/exfat/inode.c     |    6 +++---
 fs/ext2/inode.c      |    2 +-
 fs/f2fs/file.c       |   10 +++++-----
 fs/fat/inode.c       |    4 ++--
 fs/fuse/dax.c        |    2 +-
 fs/fuse/file.c       |    8 ++++----
 fs/hfs/inode.c       |    2 +-
 fs/hfsplus/inode.c   |    2 +-
 fs/iomap/direct-io.c |    6 +++---
 fs/jfs/inode.c       |    2 +-
 fs/nfs/direct.c      |    2 +-
 fs/nilfs2/inode.c    |    2 +-
 fs/ntfs3/inode.c     |    2 +-
 fs/ocfs2/aops.c      |    2 +-
 fs/orangefs/inode.c  |    2 +-
 fs/reiserfs/inode.c  |    2 +-
 fs/udf/inode.c       |    2 +-
 23 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 50d245e8c913..29c6de67c39e 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -73,7 +73,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 			return -ENOMEM;
 	}
 
-	if (iov_iter_rw(iter) == READ) {
+	if (!(iocb->ki_flags & IOCB_WRITE)) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (user_backed_iter(iter))
 			should_dirty = true;
@@ -88,7 +88,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		goto out;
 	ret = bio.bi_iter.bi_size;
 
-	if (iov_iter_rw(iter) == WRITE)
+	if (iocb->ki_flags & IOCB_WRITE)
 		task_io_account_write(ret);
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
@@ -174,7 +174,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct blkdev_dio *dio;
 	struct bio *bio;
-	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool is_read = !(iocb->ki_flags & IOCB_WRITE), is_sync;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
@@ -296,7 +296,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 					unsigned int nr_pages)
 {
 	struct block_device *bdev = iocb->ki_filp->private_data;
-	bool is_read = iov_iter_rw(iter) == READ;
+	bool is_read = !(iocb->ki_flags & IOCB_WRITE);
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	struct blkdev_dio *dio;
 	struct bio *bio;
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 97599edbc300..383d62fc3e18 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -254,7 +254,7 @@ v9fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t n;
 	int err = 0;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		n = p9_client_write(file->private_data, pos, iter, &err);
 		if (n) {
 			struct inode *inode = file_inode(file);
diff --git a/fs/affs/file.c b/fs/affs/file.c
index cefa222f7881..1c0e80a8aab9 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -400,7 +400,7 @@ affs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		loff_t size = offset + count;
 
 		if (AFFS_I(inode)->mmu_private < size)
@@ -408,7 +408,7 @@ affs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	}
 
 	ret = blockdev_direct_IO(iocb, inode, iter, affs_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && (iocb->ki_flags & IOCB_WRITE))
 		affs_write_failed(mapping, offset + count);
 	return ret;
 }
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 764598e1efd9..8bdc5b52c271 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1284,7 +1284,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct timespec64 mtime = current_time(inode);
 	size_t count = iov_iter_count(iter);
 	loff_t pos = iocb->ki_pos;
-	bool write = iov_iter_rw(iter) == WRITE;
+	bool write = iocb->ki_flags & IOCB_WRITE;
 	bool should_dirty = !write && user_backed_iter(iter);
 
 	if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP)
diff --git a/fs/dax.c b/fs/dax.c
index c48a3a93ab29..7f4c3789907b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1405,7 +1405,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 	loff_t pos = iomi->pos;
 	struct dax_device *dax_dev = iomap->dax_dev;
 	loff_t end = pos + length, done = 0;
-	bool write = iov_iter_rw(iter) == WRITE;
+	bool write = iomi->flags & IOMAP_WRITE;
 	bool cow = write && iomap->flags & IOMAP_F_SHARED;
 	ssize_t ret = 0;
 	size_t xfer;
@@ -1455,7 +1455,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				DAX_ACCESS, &kaddr, NULL);
-		if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+		if (map_len == -EIO && write) {
 			map_len = dax_direct_access(dax_dev, pgoff,
 					PHYS_PFN(size), DAX_RECOVERY_WRITE,
 					&kaddr, NULL);
@@ -1530,7 +1530,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (!iomi.len)
 		return 0;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		lockdep_assert_held_write(&iomi.inode->i_rwsem);
 		iomi.flags |= IOMAP_WRITE;
 	} else {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 03d381377ae1..e2d5c757a27a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1143,7 +1143,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 
 	/* watch out for a 0 len io from a tricksy fs */
-	if (iov_iter_rw(iter) == READ && !count)
+	if (!(iocb->ki_flags & IOCB_WRITE) && !count)
 		return 0;
 
 	dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
@@ -1157,14 +1157,14 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	memset(dio, 0, offsetof(struct dio, pages));
 
 	dio->flags = flags;
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) {
+	if (dio->flags & DIO_LOCKING && !(iocb->ki_flags & IOCB_WRITE)) {
 		/* will be released by direct_io_worker */
 		inode_lock(inode);
 	}
 
 	/* Once we sampled i_size check for reads beyond EOF */
 	dio->i_size = i_size_read(inode);
-	if (iov_iter_rw(iter) == READ && offset >= dio->i_size) {
+	if (!(iocb->ki_flags & IOCB_WRITE) && offset >= dio->i_size) {
 		retval = 0;
 		goto fail_dio;
 	}
@@ -1177,7 +1177,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 			goto fail_dio;
 	}
 
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) {
+	if (dio->flags & DIO_LOCKING && !(iocb->ki_flags & IOCB_WRITE)) {
 		struct address_space *mapping = iocb->ki_filp->f_mapping;
 
 		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
@@ -1193,13 +1193,13 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	if (is_sync_kiocb(iocb))
 		dio->is_async = false;
-	else if (iov_iter_rw(iter) == WRITE && end > i_size_read(inode))
+	else if ((iocb->ki_flags & IOCB_WRITE) && end > i_size_read(inode))
 		dio->is_async = false;
 	else
 		dio->is_async = true;
 
 	dio->inode = inode;
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		dio->opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			dio->opf |= REQ_NOWAIT;
@@ -1211,7 +1211,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+	if (dio->is_async && (iocb->ki_flags & IOCB_WRITE)) {
 		retval = 0;
 		if (iocb_is_dsync(iocb))
 			retval = dio_set_defer_completion(dio);
@@ -1248,7 +1248,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
-	dio->should_dirty = user_backed_iter(iter) && iov_iter_rw(iter) == READ;
+	dio->should_dirty = user_backed_iter(iter) && !(iocb->ki_flags & IOCB_WRITE);
 	sdio.iter = iter;
 	sdio.final_block_in_request = end >> blkbits;
 
@@ -1305,7 +1305,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * we can let i_mutex go now that its achieved its purpose
 	 * of protecting us from looking up uninitialized blocks.
 	 */
-	if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
+	if (!(iocb->ki_flags & IOCB_WRITE) && (dio->flags & DIO_LOCKING))
 		inode_unlock(dio->inode);
 
 	/*
@@ -1317,7 +1317,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	BUG_ON(retval == -EIOCBQUEUED);
 	if (dio->is_async && retval == 0 && dio->result &&
-	    (iov_iter_rw(iter) == READ || dio->result == count))
+	    (!(iocb->ki_flags & IOCB_WRITE) || dio->result == count))
 		retval = -EIOCBQUEUED;
 	else
 		dio_await_completion(dio);
@@ -1330,7 +1330,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	return retval;
 
 fail_dio:
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ)
+	if (dio->flags & DIO_LOCKING && !(iocb->ki_flags & IOCB_WRITE))
 		inode_unlock(inode);
 
 	kmem_cache_free(dio_cache, dio);
diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c
index 5b644cb057fa..26c2cff71878 100644
--- a/fs/exfat/inode.c
+++ b/fs/exfat/inode.c
@@ -412,10 +412,10 @@ static ssize_t exfat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = mapping->host;
 	loff_t size = iocb->ki_pos + iov_iter_count(iter);
-	int rw = iov_iter_rw(iter);
+	bool writing = iocb->ki_flags & IOCB_WRITE;
 	ssize_t ret;
 
-	if (rw == WRITE) {
+	if (writing) {
 		/*
 		 * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 		 * so we need to update the ->i_size_aligned to block boundary.
@@ -434,7 +434,7 @@ static ssize_t exfat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * condition of exfat_get_block() and ->truncate().
 	 */
 	ret = blockdev_direct_IO(iocb, inode, iter, exfat_get_block);
-	if (ret < 0 && (rw & WRITE))
+	if (ret < 0 && writing)
 		exfat_write_failed(mapping, size);
 	return ret;
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 69aed9e2359e..9ed588d70722 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -919,7 +919,7 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t ret;
 
 	ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && (iocb->ki_flags & IOCB_WRITE))
 		ext2_write_failed(mapping, offset + count);
 	return ret;
 }
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ecbc8c135b49..7a7cfa39b327 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -809,7 +809,7 @@ int f2fs_truncate(struct inode *inode)
 	return 0;
 }
 
-static bool f2fs_force_buffered_io(struct inode *inode, int rw)
+static bool f2fs_force_buffered_io(struct inode *inode, bool writing)
 {
 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
 
@@ -827,9 +827,9 @@ static bool f2fs_force_buffered_io(struct inode *inode, int rw)
 	 * for blkzoned device, fallback direct IO to buffered IO, so
 	 * all IOs can be serialized by log-structured write.
 	 */
-	if (f2fs_sb_has_blkzoned(sbi) && (rw == WRITE))
+	if (f2fs_sb_has_blkzoned(sbi) && writing)
 		return true;
-	if (f2fs_lfs_mode(sbi) && rw == WRITE && F2FS_IO_ALIGNED(sbi))
+	if (f2fs_lfs_mode(sbi) && writing && F2FS_IO_ALIGNED(sbi))
 		return true;
 	if (is_sbi_flag_set(sbi, SBI_CP_DISABLED))
 		return true;
@@ -865,7 +865,7 @@ int f2fs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		unsigned int bsize = i_blocksize(inode);
 
 		stat->result_mask |= STATX_DIOALIGN;
-		if (!f2fs_force_buffered_io(inode, WRITE)) {
+		if (!f2fs_force_buffered_io(inode, true)) {
 			stat->dio_mem_align = bsize;
 			stat->dio_offset_align = bsize;
 		}
@@ -4254,7 +4254,7 @@ static bool f2fs_should_use_dio(struct inode *inode, struct kiocb *iocb,
 	if (!(iocb->ki_flags & IOCB_DIRECT))
 		return false;
 
-	if (f2fs_force_buffered_io(inode, iov_iter_rw(iter)))
+	if (f2fs_force_buffered_io(inode, iocb->ki_flags & IOCB_WRITE))
 		return false;
 
 	/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index d99b8549ec8f..d7ffc30ce0e5 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -261,7 +261,7 @@ static ssize_t fat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		/*
 		 * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 		 * so we need to update the ->mmu_private to block boundary.
@@ -281,7 +281,7 @@ static ssize_t fat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * condition of fat_get_block() and ->truncate().
 	 */
 	ret = blockdev_direct_IO(iocb, inode, iter, fat_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && (iocb->ki_flags & IOCB_WRITE))
 		fat_write_failed(mapping, offset + count);
 
 	return ret;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index e23e802a8013..fddd0b8de27e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -720,7 +720,7 @@ static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	return (iov_iter_rw(from) == WRITE &&
+	return ((iocb->ki_flags & IOCB_WRITE) &&
 		((iocb->ki_pos) >= i_size_read(inode) ||
 		  (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode))));
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 875314ee6f59..9575c4ca0667 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2897,7 +2897,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	inode = file->f_mapping->host;
 	i_size = i_size_read(inode);
 
-	if ((iov_iter_rw(iter) == READ) && (offset >= i_size))
+	if (!(iocb->ki_flags & IOCB_WRITE) && (offset >= i_size))
 		return 0;
 
 	io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL);
@@ -2909,7 +2909,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	io->bytes = -1;
 	io->size = 0;
 	io->offset = offset;
-	io->write = (iov_iter_rw(iter) == WRITE);
+	io->write = (iocb->ki_flags & IOCB_WRITE);
 	io->err = 0;
 	/*
 	 * By default, we want to optimize all I/Os with async request
@@ -2942,7 +2942,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		io->done = &wait;
 	}
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if ((iocb->ki_flags & IOCB_WRITE)) {
 		ret = fuse_direct_io(io, iter, &pos, FUSE_DIO_WRITE);
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
 	} else {
@@ -2965,7 +2965,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	kref_put(&io->refcnt, fuse_io_release);
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if ((iocb->ki_flags & IOCB_WRITE)) {
 		fuse_write_update_attr(inode, pos, ret);
 		/* For extending writes we already hold exclusive lock */
 		if (ret < 0 && offset + count > i_size)
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 9c329a365e75..638c87afd96f 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -141,7 +141,7 @@ static ssize_t hfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 840577a0c1e7..843e6f1ced25 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -138,7 +138,7 @@ static ssize_t hfsplus_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..d045b0c54d04 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -519,7 +519,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->submit.waiter = current;
 	dio->submit.poll_bio = NULL;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (!(iocb->ki_flags & IOCB_WRITE)) {
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
@@ -573,7 +573,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret)
 		goto out_free_dio;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iomi.flags & IOMAP_WRITE) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
 		 * If this invalidation fails, let the caller fall back to
@@ -613,7 +613,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 * Revert iter to a state corresponding to that as some callers (such
 	 * as the splice code) rely on it.
 	 */
-	if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
+	if (!(iomi.flags & IOMAP_WRITE) && iomi.pos >= dio->i_size)
 		iov_iter_revert(iter, iomi.pos - dio->i_size);
 
 	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 8ac10e396050..f403d2f2bfe6 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -334,7 +334,7 @@ static ssize_t jfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..e94ce42f93a8 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -133,7 +133,7 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 
 	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
 
-	if (iov_iter_rw(iter) == READ)
+	if (!(iocb->ki_flags & IOCB_WRITE))
 		ret = nfs_file_direct_read(iocb, iter, true);
 	else
 		ret = nfs_file_direct_write(iocb, iter, true);
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 232dd7b6cca1..59df5707d30f 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -289,7 +289,7 @@ nilfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	if (iov_iter_rw(iter) == WRITE)
+	if (iocb->ki_flags & IOCB_WRITE)
 		return 0;
 
 	/* Needs synchronization with the cleaner */
diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index 20b953871574..f0881d0522f8 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -761,7 +761,7 @@ static ssize_t ntfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct ntfs_inode *ni = ntfs_i(inode);
 	loff_t vbo = iocb->ki_pos;
 	loff_t end;
-	int wr = iov_iter_rw(iter) & WRITE;
+	bool wr = iocb->ki_flags & IOCB_WRITE;
 	size_t iter_count = iov_iter_count(iter);
 	loff_t valid;
 	ssize_t ret;
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1d65f6ef00ca..3f41c6b403c2 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2441,7 +2441,7 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	    !ocfs2_supports_append_dio(osb))
 		return 0;
 
-	if (iov_iter_rw(iter) == READ)
+	if (!(iocb->ki_flags & IOCB_WRITE))
 		get_block = ocfs2_lock_get_block;
 	else
 		get_block = ocfs2_dio_wr_get_block;
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 4df560894386..fbecca379e91 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -521,7 +521,7 @@ static ssize_t orangefs_direct_IO(struct kiocb *iocb,
 	 */
 	struct file *file = iocb->ki_filp;
 	loff_t pos = iocb->ki_pos;
-	enum ORANGEFS_io_type type = iov_iter_rw(iter) == WRITE ?
+	enum ORANGEFS_io_type type = (iocb->ki_flags & IOCB_WRITE) ?
             ORANGEFS_IO_WRITE : ORANGEFS_IO_READ;
 	loff_t *offset = &pos;
 	struct inode *inode = file->f_mapping->host;
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index c7d1fa526dea..1fc94fd5c371 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3249,7 +3249,7 @@ static ssize_t reiserfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 1d7c2a812fc1..6d2ce0e512f4 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -219,7 +219,7 @@ static ssize_t udf_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t ret;
 
 	ret = blockdev_direct_IO(iocb, inode, iter, udf_get_block);
-	if (unlikely(ret < 0 && iov_iter_rw(iter) == WRITE))
+	if (unlikely(ret < 0 && (iocb->ki_flags & IOCB_WRITE)))
 		udf_write_failed(mapping, iocb->ki_pos + count);
 	return ret;
 }



^ permalink raw reply related

* [PATCH v4 6/7] netfs: Add a function to extract an iterator into a scatterlist
From: David Howells @ 2023-01-07  0:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Steve French, Shyam Prasad N, Rohith Surabattula,
	linux-cachefs, linux-cifs, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel
In-Reply-To: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk>

Provide a function for filling in a scatterlist from the list of pages
contained in an iterator.

If the iterator is UBUF- or IOBUF-type, the pages have a ref (ITER_SOURCE,
ie. WRITE) or a pin (ITER_DEST, ie. READ) taken on them.

If the iterator is BVEC-, KVEC- or XARRAY-type, no ref is taken on the
pages and it is left to the caller to manage their lifetime.  It cannot be
assumed that a ref can be validly taken, particularly in the case of a KVEC
iterator.

Changes:
========
ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697255985.61150.16489950598033809487.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732027275.3186319.5186488812166611598.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869691313.3723671.10714823767342163891.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920905749.1461876.12079195122363691498.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997423514.9475.11145024341505464337.stgit@warthog.procyon.org.uk/ # v3
---

 fs/netfs/iterator.c   |  268 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |    4 +
 mm/vmalloc.c          |    1 
 3 files changed, 273 insertions(+)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 7d802d21b9c5..82c46e3caf1d 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -7,7 +7,9 @@
 
 #include <linux/export.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/uio.h>
+#include <linux/scatterlist.h>
 #include <linux/netfs.h>
 #include "internal.h"
 
@@ -97,3 +99,269 @@ ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 	return npages;
 }
 EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
+
+/*
+ * Extract as list of up to sg_max pages from UBUF- or IOVEC-class iterators,
+ * pin or get refs on them appropriate and add them to the scatterlist.
+ */
+static ssize_t netfs_extract_user_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int *cleanup_mode)
+{
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	struct page **pages;
+	unsigned int npages;
+	ssize_t ret = 0, res;
+	size_t len, off;
+
+	*cleanup_mode = 0;
+
+	/* We decant the page list into the tail of the scatterlist */
+	pages = (void *)sgtable->sgl + array_size(sg_max, sizeof(struct scatterlist));
+	pages -= sg_max;
+
+	do {
+		res = iov_iter_extract_pages(iter, &pages, maxsize, sg_max, 0,
+					     &off, cleanup_mode);
+		if (res < 0)
+			goto failed;
+
+		len = res;
+		maxsize -= len;
+		ret += len;
+		npages = DIV_ROUND_UP(off + len, PAGE_SIZE);
+		sg_max -= npages;
+
+		for (; npages < 0; npages--) {
+			struct page *page = *pages;
+			size_t seg = min_t(size_t, PAGE_SIZE - off, len);
+
+			*pages++ = NULL;
+			sg_set_page(sg, page, len, off);
+			sgtable->nents++;
+			sg++;
+			len -= seg;
+			off = 0;
+		}
+	} while (maxsize > 0 && sg_max > 0);
+
+	return ret;
+
+failed:
+	while (sgtable->nents > sgtable->orig_nents)
+		put_page(sg_page(&sgtable->sgl[--sgtable->nents]));
+	return res;
+}
+
+/*
+ * Extract up to sg_max pages from a BVEC-type iterator and add them to the
+ * scatterlist.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_bvec_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int *cleanup_mode)
+{
+	const struct bio_vec *bv = iter->bvec;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t off, len;
+
+		len = bv[i].bv_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		off = bv[i].bv_offset + start;
+
+		sg_set_page(sg, bv[i].bv_page, len, off);
+		sgtable->nents++;
+		sg++;
+		sg_max--;
+
+		ret += len;
+		maxsize -= len;
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+		start = 0;
+	}
+
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	*cleanup_mode = 0;
+	return ret;
+}
+
+/*
+ * Extract up to sg_max pages from a KVEC-type iterator and add them to the
+ * scatterlist.  This can deal with vmalloc'd buffers as well as kmalloc'd or
+ * static buffers.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_kvec_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int *cleanup_mode)
+{
+	const struct kvec *kv = iter->kvec;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		struct page *page;
+		unsigned long kaddr;
+		size_t off, len, seg;
+
+		len = kv[i].iov_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		kaddr = (unsigned long)kv[i].iov_base + start;
+		off = kaddr & ~PAGE_MASK;
+		len = min_t(size_t, maxsize, len - start);
+		kaddr &= PAGE_MASK;
+
+		maxsize -= len;
+		ret += len;
+		do {
+			seg = min_t(size_t, len, PAGE_SIZE - off);
+			if (is_vmalloc_or_module_addr((void *)kaddr))
+				page = vmalloc_to_page((void *)kaddr);
+			else
+				page = virt_to_page(kaddr);
+
+			sg_set_page(sg, page, len, off);
+			sgtable->nents++;
+			sg++;
+			sg_max--;
+
+			len -= seg;
+			kaddr += PAGE_SIZE;
+			off = 0;
+		} while (len > 0 && sg_max > 0);
+
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+		start = 0;
+	}
+
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	*cleanup_mode = 0;
+	return ret;
+}
+
+/*
+ * Extract up to sg_max folios from an XARRAY-type iterator and add them to
+ * the scatterlist.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_xarray_to_sg(struct iov_iter *iter,
+					  ssize_t maxsize,
+					  struct sg_table *sgtable,
+					  unsigned int sg_max,
+					  unsigned int *cleanup_mode)
+{
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	struct xarray *xa = iter->xarray;
+	struct folio *folio;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t index = start / PAGE_SIZE;
+	ssize_t ret = 0;
+	size_t offset, len;
+	XA_STATE(xas, xa, index);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio, ULONG_MAX) {
+		if (xas_retry(&xas, folio))
+			continue;
+		if (WARN_ON(xa_is_value(folio)))
+			break;
+		if (WARN_ON(folio_test_hugetlb(folio)))
+			break;
+
+		offset = offset_in_folio(folio, start);
+		len = min_t(size_t, maxsize, folio_size(folio) - offset);
+
+		sg_set_page(sg, folio_page(folio, 0), len, offset);
+		sgtable->nents++;
+		sg++;
+		sg_max--;
+
+		maxsize -= len;
+		ret += len;
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+	}
+
+	rcu_read_unlock();
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	*cleanup_mode = 0;
+	return ret;
+}
+
+/**
+ * netfs_extract_iter_to_sg - Extract pages from an iterator and add ot an sglist
+ * @iter: The iterator to extract from
+ * @maxsize: The amount of iterator to copy
+ * @sgtable: The scatterlist table to fill in
+ * @sg_max: Maximum number of elements in @sgtable that may be filled
+ * @cleanup_mode: Where to return the cleanup mode
+ *
+ * Extract the page fragments from the given amount of the source iterator and
+ * add them to a scatterlist that refers to all of those bits, to a maximum
+ * addition of @sg_max elements.
+ *
+ * The pages referred to by UBUF- and IOVEC-type iterators are extracted and
+ * pinned; BVEC-, KVEC- and XARRAY-type are extracted but aren't pinned; PIPE-
+ * and DISCARD-type are not supported.
+ *
+ * No end mark is placed on the scatterlist; that's left to the caller.
+ *
+ * If successul, @sgtable->nents is updated to include the number of elements
+ * added and the number of bytes added is returned.  @sgtable->orig_nents is
+ * left unaltered.
+ */
+ssize_t netfs_extract_iter_to_sg(struct iov_iter *iter, size_t maxsize,
+				 struct sg_table *sgtable, unsigned int sg_max,
+				 unsigned int *cleanup_mode)
+{
+	if (maxsize == 0)
+		return 0;
+
+	switch (iov_iter_type(iter)) {
+	case ITER_UBUF:
+	case ITER_IOVEC:
+		return netfs_extract_user_to_sg(iter, maxsize, sgtable, sg_max,
+						cleanup_mode);
+	case ITER_BVEC:
+		return netfs_extract_bvec_to_sg(iter, maxsize, sgtable, sg_max,
+						cleanup_mode);
+	case ITER_KVEC:
+		return netfs_extract_kvec_to_sg(iter, maxsize, sgtable, sg_max,
+						cleanup_mode);
+	case ITER_XARRAY:
+		return netfs_extract_xarray_to_sg(iter, maxsize, sgtable, sg_max,
+						  cleanup_mode);
+	default:
+		pr_err("netfs_extract_iter_to_sg(%u) unsupported\n",
+		       iov_iter_type(iter));
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
+}
+EXPORT_SYMBOL_GPL(netfs_extract_iter_to_sg);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 26fe3e6bafa1..059e49233e29 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -299,6 +299,10 @@ void netfs_stats_show(struct seq_file *);
 ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 				struct iov_iter *new,
 				unsigned int *cleanup_mode);
+struct sg_table;
+ssize_t netfs_extract_iter_to_sg(struct iov_iter *iter, size_t len,
+				 struct sg_table *sgtable, unsigned int sg_max,
+				 unsigned int *cleanup_mode);
 
 /**
  * netfs_inode - Get the netfs inode context from the inode
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ca71de7c9d77..61f5bec0f2b6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -656,6 +656,7 @@ int is_vmalloc_or_module_addr(const void *x)
 #endif
 	return is_vmalloc_addr(x);
 }
+EXPORT_SYMBOL_GPL(is_vmalloc_or_module_addr);
 
 /*
  * Walk a vmap address to the struct page it maps. Huge vmap mappings will



^ permalink raw reply related

* [PATCH v4 1/7] iov_iter: Change the direction macros into an enum
From: David Howells @ 2023-01-07  0:33 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel
In-Reply-To: <167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk>

Change the ITER_SOURCE and ITER_DEST direction macros into an enum and
provide three new helper functions:

 iov_iter_dir() - returns the iterator direction
 iov_iter_is_dest() - returns true if it's an ITER_DEST iterator
 iov_iter_is_source() - returns true if it's an ITER_SOURCE iterator

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
---

 include/linux/uio.h |   32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..4b0f4a773d90 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,8 +29,10 @@ enum iter_type {
 	ITER_UBUF,
 };
 
-#define ITER_SOURCE	1	// == WRITE
-#define ITER_DEST	0	// == READ
+enum iter_dir {
+	ITER_DEST	= 0,	// == READ
+	ITER_SOURCE	= 1,	// == WRITE
+} __mode(byte);
 
 struct iov_iter_state {
 	size_t iov_offset;
@@ -39,9 +41,9 @@ struct iov_iter_state {
 };
 
 struct iov_iter {
-	u8 iter_type;
+	enum iter_type iter_type __mode(byte);
 	bool nofault;
-	bool data_source;
+	enum iter_dir data_source;
 	bool user_backed;
 	union {
 		size_t iov_offset;
@@ -114,9 +116,29 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_XARRAY;
 }
 
+static inline enum iter_dir iov_iter_dir(const struct iov_iter *i)
+{
+	return i->data_source;
+}
+
+static inline bool iov_iter_is_source(const struct iov_iter *i)
+{
+	return iov_iter_dir(i) == ITER_SOURCE; /* ie. WRITE */
+}
+
+static inline bool iov_iter_is_dest(const struct iov_iter *i)
+{
+	return iov_iter_dir(i) == ITER_DEST; /* ie. READ */
+}
+
+static inline bool iov_iter_dir_valid(enum iter_dir direction)
+{
+	return direction == ITER_DEST || direction == ITER_SOURCE;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
-	return i->data_source ? WRITE : READ;
+	return iov_iter_dir(i) == ITER_SOURCE ? WRITE : READ;
 }
 
 static inline bool user_backed_iter(const struct iov_iter *i)



^ permalink raw reply related

* [PATCH v4 0/7] iov_iter: Add extraction helpers
From: David Howells @ 2023-01-07  0:33 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-cifs, Christoph Hellwig, Steve French,
	linux-mm, Jens Axboe, Logan Gunthorpe, Matthew Wilcox,
	John Hubbard, linux-cachefs, Rohith Surabattula, Shyam Prasad N,
	Jeff Layton, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel


Hi Al,

Here are patches clean up some use of READ/WRITE and ITER_SOURCE/DEST,
patches to provide support for extracting pages from an iov_iter and a
patch to use the primary extraction function in the block layer bio code if
you could take a look?

[!] NOTE that I've switched the functions to be exported GPL-only at
    Christoph's request[1].  They are, however, intended as a replacement
    for iov_iter_get_pages*(), which is not marked _GPL - so that
    functionality will probably become unavailable to non-GPL 3rd party
    modules in future.

The first three patches deal with ITER_SOURCE/DEST:

 (1) Switch ITER_SOURCE/DEST to an enum and add a couple of helper
     functions to query if an iterator represents a source or a destination
     buffer.  Using an enum may allow extra consistency warnings from the
     compiler.

 (2) Use the ITER_SOURCE/DEST values in the iov_iter core functions rather
     than READ/WRITE.

 (3) Get rid of most of the callers of iov_iter_rw(), using the IOCB_WRITE
     and IOMAP_WRITE instead where available.  This leaves only two places
     looking at the this value: a consistency check in cifs and a single
     place in the block layer.

The next patch adds a replacement for iov_iter_get_pages*(), including
Logan's new version:

 (4) Add a function to list-only, get or pin pages from an iterator as a
     future replacement for iov_iter_get_pages*().  Pointers to the pages
     are placed into an array (which will get allocated if not provided)
     and, depending on the iterator type and direction, the pages will have
     a ref or a pin get on them or be left untouched (on the assumption
     that the caller manages their lifetime).

     The determination is:

	UBUF/IOVEC + DEST	-> pin
	UBUF/IOVEC + SOURCE	-> get
	PIPE + DEST		-> list-only
	BVEC/XARRAY		-> list-only
	Anything else		-> EFAULT

     The function also returns an indication of which of "list only, get or
     pin" the extraction function did to aid in cleaning up (returning 0,
     FOLL_GET or FOLL_PIN as appropriate).

Then there are a couple of patches that add stuff to netfslib that I want
to use there as well as in cifs:

 (5) Add a netfslib function to use (4) to extract pages from an ITER_IOBUF
     or ITER_UBUF iterator into an ITER_BVEC iterator.  This will get or
     pin the pages as appropriate.

 (6) Add a netfslib function to extract pages from an iterator that's of
     type ITER_UBUF/IOVEC/BVEC/KVEC/XARRAY and add them to a scatterlist.

     The function in (4) is used for a UBUF and IOVEC iterators, so those
     need cleaning up afterwards.  BVEC and XARRAY iterators are ungot and
     unpinned and may be rendered into elements that span multiple pages,
     for example if large folios are present.

Finally, there's an example test patch for the extract pages:

 (7) Make the block layer's BIO code pin pages or leave pages unaltered
     rather than getting a ref on the pages when the circumstances warrant,
     and noting in the bio struct what cleanups should be performed so that
     the bio cleanup code then does the right thing.

Changes:
========
ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add patches 1-3 and 7.
 - Make patches 4-6 use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3

---
David Howells (7):
      iov_iter: Change the direction macros into an enum
      iov_iter: Use the direction in the iterator functions
      iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
      iov_iter: Add a function to extract a page list from an iterator
      netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
      netfs: Add a function to extract an iterator into a scatterlist
      iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate


 block/bio.c               |  47 +++--
 block/fops.c              |   8 +-
 fs/9p/vfs_addr.c          |   2 +-
 fs/affs/file.c            |   4 +-
 fs/ceph/file.c            |   2 +-
 fs/dax.c                  |   6 +-
 fs/direct-io.c            |  22 +--
 fs/exfat/inode.c          |   6 +-
 fs/ext2/inode.c           |   2 +-
 fs/f2fs/file.c            |  10 +-
 fs/fat/inode.c            |   4 +-
 fs/fuse/dax.c             |   2 +-
 fs/fuse/file.c            |   8 +-
 fs/hfs/inode.c            |   2 +-
 fs/hfsplus/inode.c        |   2 +-
 fs/iomap/direct-io.c      |   6 +-
 fs/jfs/inode.c            |   2 +-
 fs/netfs/Makefile         |   1 +
 fs/netfs/iterator.c       | 367 ++++++++++++++++++++++++++++++++++
 fs/nfs/direct.c           |   2 +-
 fs/nilfs2/inode.c         |   2 +-
 fs/ntfs3/inode.c          |   2 +-
 fs/ocfs2/aops.c           |   2 +-
 fs/orangefs/inode.c       |   2 +-
 fs/reiserfs/inode.c       |   2 +-
 fs/udf/inode.c            |   2 +-
 include/linux/blk_types.h |   1 +
 include/linux/netfs.h     |   7 +
 include/linux/uio.h       |  59 ++++--
 lib/iov_iter.c            | 407 +++++++++++++++++++++++++++++++++++---
 mm/vmalloc.c              |   1 +
 31 files changed, 889 insertions(+), 103 deletions(-)
 create mode 100644 fs/netfs/iterator.c



^ permalink raw reply

* Re: [PATCH 3/7] nvmet: introduce bdev_zone_no helper
From: Bart Van Assche @ 2023-01-06 22:39 UTC (permalink / raw)
  To: Pankaj Raghav, axboe
  Cc: kernel, linux-kernel, hare, snitzer, dm-devel, damien.lemoal,
	linux-nvme, hch, linux-block, gost.dev
In-Reply-To: <20230106083317.93938-4-p.raghav@samsung.com>

On 1/6/23 00:33, Pankaj Raghav wrote:
> A generic bdev_zone_no() helper is added to calculate zone number for a
> given sector in a block device. This helper internally uses disk_zone_no()
> to find the zone number.
> 
> Use the helper bdev_zone_no() to calculate nr of zones. This let's us
> make modifications to the math if needed in one place.

Please use the imperative mood in patch descriptions (... is added -> 
add ...).

let's -> lets

If these two issues are addressed, please add:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply

* Re: [PATCH 2/7] block: add a new helper bdev_{is_zone_start, offset_from_zone_start}
From: Bart Van Assche @ 2023-01-06 22:36 UTC (permalink / raw)
  To: Pankaj Raghav, axboe
  Cc: kernel, linux-kernel, hare, snitzer, dm-devel, damien.lemoal,
	linux-nvme, hch, linux-block, gost.dev
In-Reply-To: <20230106083317.93938-3-p.raghav@samsung.com>

On 1/6/23 00:33, Pankaj Raghav wrote:
> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
> +						   sector_t sec)
> +{
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	return sec & (bdev_zone_sectors(bdev) - 1);
> +}
> +
> +static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
> +{
> +	if (!bdev_is_zoned(bdev))
> +		return false;
> +
> +	return bdev_offset_from_zone_start(bdev, sec) == 0;
> +}

A nit: 'sector_t sector' is much more common in the block layer than 
'sector_t sec'. Please consider changing 'sec' into 'sector'.

Thanks,

Bart.


^ permalink raw reply

* Re: [PATCH 1/7] block: remove superfluous check for request queue in bdev_is_zoned
From: Bart Van Assche @ 2023-01-06 22:34 UTC (permalink / raw)
  To: Pankaj Raghav, axboe
  Cc: kernel, linux-kernel, hare, snitzer, dm-devel, damien.lemoal,
	linux-nvme, hch, linux-block, gost.dev
In-Reply-To: <20230106083317.93938-2-p.raghav@samsung.com>

On 1/6/23 00:33, Pankaj Raghav wrote:
> Remove the superfluous request queue check in bdev_is_zoned() as the
> bdev_get_queue can never return NULL.

the bdev_get_queue -> bdev_get_queue()

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply

* Re: [External] [LSF/MM/BPF BoF] Session for Zoned Storage 2023
From: Viacheslav A.Dubeyko @ 2023-01-06 22:18 UTC (permalink / raw)
  To: lsf-pc
  Cc: Linux FS Devel, linux-block, Luis Chamberlain,
	Javier González, Matias Bjørling, Bart Van Assche,
	Damien Le Moal, Adam Manzanares, Hans Holmberg,
	Viacheslav Dubeyko
In-Reply-To: <F6BF25E2-FF26-48F2-8378-3CB36E362313@dubeyko.com>

CC: LSF/MM/BPF mailing list. Sorry, missed the list.

> On Jan 6, 2023, at 11:17 AM, Viacheslav Dubeyko <slava@dubeyko.com> wrote:
> 
> Hello,
> 
> I would like to suggest to have ZNS related session again. I think we have a lot of to discuss.
> As far as I can see, I have two topics for discussion. And I believe there are multiple other
> topics too.
> 
> How everybody feels to have a room for ZNS related discussion?
> 
> Thanks,
> Slava.



^ permalink raw reply

* Re: [GIT PULL] Block fixes for 6.2-rc3
From: pr-tracker-bot @ 2023-01-06 21:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, linux-block@vger.kernel.org
In-Reply-To: <e9134737-84e2-143c-258b-6945d492a789@kernel.dk>

The pull request you sent on Fri, 6 Jan 2023 09:27:53 -0700:

> git://git.kernel.dk/linux.git tags/block-2023-01-06

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/a689b938df39ab513026c53fb7011fd7cd594943

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Jens Axboe @ 2023-01-06 20:52 UTC (permalink / raw)
  To: Luis Chamberlain, Tejun Heo
  Cc: Dan Carpenter, Christoph Hellwig, linux-block, linux-kernel
In-Reply-To: <Y7iIWA6h88cYjhcO@bombadil.infradead.org>

On 1/6/23 1:45 PM, Luis Chamberlain wrote:
> On Fri, Jan 06, 2023 at 10:34:10AM -1000, Tejun Heo wrote:
>> Dan reports the following smatch detected the following:
>>
>>   block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context
>>
>> caused by blkcg_schedule_throttle() calling blk_put_queue() in an
>> non-sleepable context.
>>
>> blk_put_queue() acquired might_sleep() in 63f93fd6fa57 ("block: mark
>> blk_put_queue as potentially blocking") which transferred the might_sleep()
>> from blk_free_queue().
>>
>> blk_free_queue() acquired might_sleep() in e8c7d14ac6c3 ("block: revert back
>> to synchronous request_queue removal") while turning request_queue removal
>> synchronous. However, this isn't necessary as nothing in the free path
>> actually requires sleeping.
>>
>> It's pretty unusual to require a sleeping context in a put operation and
>> it's not needed in the first place. Let's drop it.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Reported-by: Dan Carpenter <error27@gmail.com>
>> Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Luis Chamberlain <mcgrof@kernel.org>
>> Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal") # v5.9+
> 
> *tons* has changed since e8c7d14ac6c3 and so the bots might think that
> *if* this patch is applied upstream it is justified for older kernels
> and I don't think that's yet been verified and doubt it.
> 
> And so I think adding a "Fixes" tag is not appropriate here.
> 
> First I'd like to hear from Christoph if he agrees with this patch
> upstream. For stable, someone would have to do the homework.

Outside of the easily audited paths, the kobj release paths are the
only ones of concern. And I didn't spot anything that sleeps. Looks
fine to me.

-- 
Jens Axboe



^ permalink raw reply

* Re: [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Tejun Heo @ 2023-01-06 20:47 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jens Axboe, Dan Carpenter, Christoph Hellwig, linux-block,
	linux-kernel
In-Reply-To: <Y7iIWA6h88cYjhcO@bombadil.infradead.org>

On Fri, Jan 06, 2023 at 12:45:12PM -0800, Luis Chamberlain wrote:
> *tons* has changed since e8c7d14ac6c3 and so the bots might think that
> *if* this patch is applied upstream it is justified for older kernels
> and I don't think that's yet been verified and doubt it.

Didn't look like anything relevant changed to me. Besides, all that the
patch does is removing a might_sleep() which can't hurt anything.

-- 
tejun

^ permalink raw reply

* Re: [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Luis Chamberlain @ 2023-01-06 20:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Dan Carpenter, Christoph Hellwig, linux-block,
	linux-kernel
In-Reply-To: <Y7iFwjN+XzWvLv3y@slm.duckdns.org>

On Fri, Jan 06, 2023 at 10:34:10AM -1000, Tejun Heo wrote:
> Dan reports the following smatch detected the following:
> 
>   block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context
> 
> caused by blkcg_schedule_throttle() calling blk_put_queue() in an
> non-sleepable context.
> 
> blk_put_queue() acquired might_sleep() in 63f93fd6fa57 ("block: mark
> blk_put_queue as potentially blocking") which transferred the might_sleep()
> from blk_free_queue().
> 
> blk_free_queue() acquired might_sleep() in e8c7d14ac6c3 ("block: revert back
> to synchronous request_queue removal") while turning request_queue removal
> synchronous. However, this isn't necessary as nothing in the free path
> actually requires sleeping.
> 
> It's pretty unusual to require a sleeping context in a put operation and
> it's not needed in the first place. Let's drop it.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Dan Carpenter <error27@gmail.com>
> Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal") # v5.9+

*tons* has changed since e8c7d14ac6c3 and so the bots might think that
*if* this patch is applied upstream it is justified for older kernels
and I don't think that's yet been verified and doubt it.

And so I think adding a "Fixes" tag is not appropriate here.

First I'd like to hear from Christoph if he agrees with this patch
upstream. For stable, someone would have to do the homework.

  Luis

^ permalink raw reply

* [PATCH block/for-6.2-fixes] block: Drop spurious might_sleep() from blk_put_queue()
From: Tejun Heo @ 2023-01-06 20:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dan Carpenter, Christoph Hellwig, Luis Chamberlain, linux-block,
	linux-kernel
In-Reply-To: <9ac3390c-055b-546c-f1f4-68350dfe04f8@kernel.dk>

Dan reports the following smatch detected the following:

  block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context

caused by blkcg_schedule_throttle() calling blk_put_queue() in an
non-sleepable context.

blk_put_queue() acquired might_sleep() in 63f93fd6fa57 ("block: mark
blk_put_queue as potentially blocking") which transferred the might_sleep()
from blk_free_queue().

blk_free_queue() acquired might_sleep() in e8c7d14ac6c3 ("block: revert back
to synchronous request_queue removal") while turning request_queue removal
synchronous. However, this isn't necessary as nothing in the free path
actually requires sleeping.

It's pretty unusual to require a sleeping context in a put operation and
it's not needed in the first place. Let's drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili
Cc: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal") # v5.9+
--
 block/blk-core.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9321767470dc..b5098355d8b2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -283,12 +283,9 @@ static void blk_free_queue(struct request_queue *q)
  *
  * Decrements the refcount of the request_queue and free it when the refcount
  * reaches 0.
- *
- * Context: Can sleep.
  */
 void blk_put_queue(struct request_queue *q)
 {
-	might_sleep();
 	if (refcount_dec_and_test(&q->refs))
 		blk_free_queue(q);
 }

^ permalink raw reply related

* Re: [PATCH v2 1/2] blk-iocost: add refcounting for iocg
From: Tejun Heo @ 2023-01-06 20:18 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yukuai (C)
In-Reply-To: <7dcdaef3-65c1-8175-fea7-53076f39697f@huaweicloud.com>

On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2023/01/06 2:32, Tejun Heo 写道:
> > On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote:
> > > 1) is related to blkg, while 2) is not, hence refcnting from blkg can't
> > > fix the problem. refcnting from blkcg_policy_data should be ok, but I
> > > see that bfq already has the similar refcnting, while other policy
> > > doesn't require such refcnting.
> > 
> > Hmm... taking a step back, wouldn't this be solved by moving the first part
> > of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there,
> > right?
> > 
> 
> Moving first part to pd_offline_fn() has some requirements, like what I
> did in the other thread:
> 
> iocg can be activated again after pd_offline_fn(), which is possible
> because bio can be dispatched when cgroup is removed. I tried to avoid
> that by:
> 
> 1) dispatch all throttled bio io ioc_pd_offline()
> 2) don't throttle bio after ioc_pd_offline()
> 
> However, you already disagreed with that. 😔

Okay, I was completely wrong while I was replying to your original patch.
Should have looked at the code closer, my apologies.

What I missed is that pd_offline doesn't happen when the cgroup goes
offline. Please take a look at the following two commits:

 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
 d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it")

After the above two commits, ->pd_offline_fn() is called only after all
possible writebacks are complete, so it shouldn't allow mass escapes to
root. With writebacks out of the picture, it might be that there can be no
further IOs once ->pd_offline_fn() is called too as there can be no tasks
left in it and no dirty pages, but best to confirm that.

So, yeah, the original approach you took should work although I'm not sure
the patches that you added to make offline blkg to bypass are necessary
(that also contributed to my assumption that there will be more IOs on those
blkg's). Have you seen more IOs coming down the pipeline after offline? If
so, can you dump some backtraces and see where they're coming from?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [LSF/MM/BPF BoF] Session for Zoned Storage 2023
From: Viacheslav Dubeyko @ 2023-01-06 19:30 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Linux FS Devel, linux-block, Javier González,
	Matias Bjørling, Bart Van Assche, Damien Le Moal,
	Adam Manzanares, Hans Holmberg, Viacheslav A. Dubeyko
In-Reply-To: <Y7h0F0w06cNM89hO@bombadil.infradead.org>

> On Jan 6, 2023, at 11:18 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Fri, Jan 06, 2023 at 11:17:19AM -0800, Viacheslav Dubeyko wrote:
>> Hello,
>> 
>> As far as I can see, I have two topics for discussion.
> 
> What's that?

I am going to share these topics in separate emails. :)

(1) I am going to share SSDFS patchset soon. And topic is:
SSDFS + ZNS SSD: deterministic architecture decreasing TCO cost of data infrastructure.

(2) Second topic is:
How to achieve better lifetime and performance of caching layer with ZNS SSD?

Thanks,
Slava.

^ permalink raw reply

* Re: [LSF/MM/BPF BoF] Session for Zoned Storage 2023
From: Luis Chamberlain @ 2023-01-06 19:18 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Linux FS Devel, linux-block, Javier González,
	Matias Bjørling, Bart Van Assche, Damien Le Moal,
	Adam Manzanares, Hans Holmberg, Viacheslav A. Dubeyko
In-Reply-To: <F6BF25E2-FF26-48F2-8378-3CB36E362313@dubeyko.com>

On Fri, Jan 06, 2023 at 11:17:19AM -0800, Viacheslav Dubeyko wrote:
> Hello,
> 
> As far as I can see, I have two topics for discussion.

What's that?

  Luis

^ permalink raw reply

* [LSF/MM/BPF BoF] Session for Zoned Storage 2023
From: Viacheslav Dubeyko @ 2023-01-06 19:17 UTC (permalink / raw)
  To: Linux FS Devel, linux-block
  Cc: Luis Chamberlain, Javier González, Matias Bjørling,
	Bart Van Assche, Damien Le Moal, Adam Manzanares, Hans Holmberg,
	Viacheslav A. Dubeyko

Hello,

I would like to suggest to have ZNS related session again. I think we have a lot of to discuss.
As far as I can see, I have two topics for discussion. And I believe there are multiple other
topics too.

How everybody feels to have a room for ZNS related discussion?

Thanks,
Slava.

^ permalink raw reply

* Re: [bug report] memcontrol: schedule throttling if we are congested
From: Luis Chamberlain @ 2023-01-06 19:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Dan Carpenter, Christoph Hellwig, linux-block,
	linux-kernel
In-Reply-To: <9ac3390c-055b-546c-f1f4-68350dfe04f8@kernel.dk>

On Fri, Jan 06, 2023 at 11:49:33AM -0700, Jens Axboe wrote:
> On 1/6/23 10:33 AM, Tejun Heo wrote:
> > Hello,
> > 
> > (cc'ing Luis, Christoph and Jens and quoting whole body)
> > 
> > On Fri, Jan 06, 2023 at 05:58:55PM +0300, Dan Carpenter wrote:
> >> Hello Tejun Heo,
> >>
> >> The patch 2cf855837b89: "memcontrol: schedule throttling if we are
> >> congested" from Jul 3, 2018, leads to the following Smatch static
> >> checker warning:
> >>
> >> block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context
> >>
> >> The call tree looks like:
> >>
> >> ioc_rqos_merge() <- disables preempt
> >> __cgroup_throttle_swaprate() <- disables preempt
> >> -> blkcg_schedule_throttle()
> >>
> >> Here is one of the callers:
> >> mm/swapfile.c
> >>   3657          spin_lock(&swap_avail_lock);
> >>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >> Takes spin lock.
> >>
> >>   3658          plist_for_each_entry_safe(si, next, &swap_avail_heads[nid],
> >>   3659                                    avail_lists[nid]) {
> >>   3660                  if (si->bdev) {
> >>   3661                          blkcg_schedule_throttle(si->bdev->bd_disk, true);
> >>                                 ^^^^^^^^^^^^^^^^^^^^^^^
> >> Calls blkcg_schedule_throttle().
> >>
> >>   3662                          break;
> >>   3663                  }
> >>   3664          }
> >>
> >> block/blk-cgroup.c
> >>   1851  void blkcg_schedule_throttle(struct gendisk *disk, bool use_memdelay)
> >>   1852  {
> >>   1853          struct request_queue *q = disk->queue;
> >>   1854  
> >>   1855          if (unlikely(current->flags & PF_KTHREAD))
> >>   1856                  return;
> >>   1857  
> >>   1858          if (current->throttle_queue != q) {
> >>   1859                  if (!blk_get_queue(q))
> >>   1860                          return;
> >>   1861  
> >>   1862                  if (current->throttle_queue)
> >>   1863                          blk_put_queue(current->throttle_queue);
> >>                                 ^^^^^^^^^^^^^^
> >> Sleeps.
> >>
> >>   1864                  current->throttle_queue = q;
> >>   1865          }
> >>   1866  
> >>   1867          if (use_memdelay)
> >>   1868                  current->use_memdelay = use_memdelay;
> >>   1869          set_notify_resume(current);
> >>   1870  }
> > 
> > In general, it's quite unusual for a put operation to require a sleepable
> > context and I could be missing sth but the actual put / release paths don't
> > seem to actually need might_sleep(). It seems sprious.
> > 
> > The might_sleep() in put was added by Christoph's 63f93fd6fa57 ("block: mark
> > blk_put_queue as potentially blocking") which promoted it from release to
> > put cuz the caller usually can't tell whether its put is the last put.
> > 
> > And that put in release was added by Luis in e8c7d14ac6c3 ("block: revert
> > back to synchronous request_queue removal") while making the release path
> > synchronous, the rationale being 

The rationale was that we reverted exepected userspace expection for
something that was sync to async so broke userspace expectations and
we can't do that.

> > that releasing asynchronously makes dynamic
> > device removal / readdition behaviors unpredictable and it also seems to
> > note that might_sleep() is no longer needed but still kept, which seems a
> > bit odd to me.
> > 
> > Here's my take on it:
> > 
> > * Let's please not require a sleepable context in a put operation. It's
> >   unusual, inconvenient and error-prone, and likely to cause its users to
> >   implement multiple copies of async mechanisms around it.
> > 
> > * A better way to deal with removal / readdition race is flushing release
> >   operaitons either at the end of removal or before trying to add something
> >   (you can get fancy w/ flushing only if there's name collision too), not
> >   making a put path synchronously call release which needs to sleep.
> > 
> > * If might_sleep() is currently not needed, let's please drop it. It just
> >   makes people scratch their head when reading the code.
> 
> I looked over the call path, and I don't think anything in there sleeps.
> So should be fine to remove the might_sleep().

As soon as commit 63f93fd6fa5717 ("block: mark blk_put_queue as
potentially blocking") on v6.2-rc1 it was upgraded to might_sleep()
directly on blk_put_queue(), I can't find a rationale after that
to justify the removal. But since it is not clear if we keep it,
we should document that rationale.

  Luis

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox