[RFC 0/4] partial sector read support

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [RFC 0/4] partial sector read support
@ 2022-05-04 16:32 Keith Busch
  2022-05-04 16:32 ` [RFC 1/4] block: export dma_alignment attribute Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Keith Busch @ 2022-05-04 16:32 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: axboe, hch, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Don't you just hate that you must read a full sector when you only cared
about a few bytes? Well, good news! Standardized protocols provide a way
for the host to describe unwanted read data, allowing partial sector
access. This lets applications reduce allocated memory that was used to
hold the unwanted data, while also reducing link traffic.

This series enables this for the NVMe protocol through direct io. With
this, a userspace app can read a single contiguous byte range, subject
to hardware DMA constraints. An in-kernel user could theorectically
construct a bio to read multiple discontiguous ranges if desired.

Keith Busch (4):
  block: export dma_alignment attribute
  block: relax direct io memory alignment
  block: add bit bucket support
  nvme: add bit bucket support

 block/blk-core.c          |  5 ++++
 block/blk-merge.c         |  3 +-
 block/blk-mq.c            |  2 ++
 block/blk-sysfs.c         | 10 +++++++
 block/fops.c              | 61 +++++++++++++++++++++++++++++++++------
 drivers/nvme/host/core.c  |  3 ++
 drivers/nvme/host/nvme.h  |  6 ++++
 drivers/nvme/host/pci.c   | 17 +++++++++--
 fs/direct-io.c            | 11 ++++---
 fs/iomap/direct-io.c      |  3 +-
 include/linux/blk-mq.h    |  2 ++
 include/linux/blk_types.h |  1 +
 include/linux/blkdev.h    | 18 ++++++++++++
 include/linux/nvme.h      |  2 ++
 14 files changed, 127 insertions(+), 17 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC 1/4] block: export dma_alignment attribute
  2022-05-04 16:32 [RFC 0/4] partial sector read support Keith Busch
@ 2022-05-04 16:32 ` Keith Busch
  2022-05-04 16:32 ` [RFC 2/4] block: relax direct io memory alignment Keith Busch
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2022-05-04 16:32 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: axboe, hch, Keith Busch

From: Keith Busch <kbusch@kernel.org>

User space may want to know how to align their buffers to avoid
bouncing. Export the queue attribute.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-sysfs.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 88bd41d4cb59..14607565d781 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -274,6 +274,11 @@ static ssize_t queue_virt_boundary_mask_show(struct request_queue *q, char *page
 	return queue_var_show(q->limits.virt_boundary_mask, page);
 }
 
+static ssize_t queue_dma_alignment_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(queue_dma_alignment(q), page);
+}
+
 #define QUEUE_SYSFS_BIT_FNS(name, flag, neg)				\
 static ssize_t								\
 queue_##name##_show(struct request_queue *q, char *page)		\
@@ -606,6 +611,7 @@ QUEUE_RO_ENTRY(queue_dax, "dax");
 QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout");
 QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec");
 QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
+QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment");
 
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 QUEUE_RW_ENTRY(blk_throtl_sample_time, "throttle_sample_time");
@@ -667,6 +673,7 @@ static struct attribute *queue_attrs[] = {
 	&blk_throtl_sample_time_entry.attr,
 #endif
 	&queue_virt_boundary_mask_entry.attr,
+	&queue_dma_alignment_entry.attr,
 	NULL,
 };
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 2/4] block: relax direct io memory alignment
  2022-05-04 16:32 [RFC 0/4] partial sector read support Keith Busch
  2022-05-04 16:32 ` [RFC 1/4] block: export dma_alignment attribute Keith Busch
@ 2022-05-04 16:32 ` Keith Busch
  2022-05-04 16:32 ` [RFC 3/4] block: add bit bucket support Keith Busch
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2022-05-04 16:32 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: axboe, hch, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Use the address alignment requirements from the hardware for direct io
instead of requiring addresses be aligned to the block size. User space
can discover the alignment requirements from the dma_alignment queue
attribute.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c           | 15 +++++++++------
 fs/direct-io.c         | 11 +++++++----
 fs/iomap/direct-io.c   |  3 ++-
 include/linux/blkdev.h |  5 +++++
 4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 9f2ecec406b0..a6583bce1e7d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -62,8 +62,9 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	struct bio bio;
 	ssize_t ret;
 
-	if ((pos | iov_iter_alignment(iter)) &
-	    (bdev_logical_block_size(bdev) - 1))
+	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
+		return -EINVAL;
+	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
 		return -EINVAL;
 
 	if (nr_pages <= DIO_INLINE_BIO_VECS)
@@ -193,8 +194,9 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
 
-	if ((pos | iov_iter_alignment(iter)) &
-	    (bdev_logical_block_size(bdev) - 1))
+	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
+		return -EINVAL;
+	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
 		return -EINVAL;
 
 	bio = bio_alloc_kiocb(iocb, bdev, nr_pages, opf, &blkdev_dio_pool);
@@ -316,8 +318,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
 
-	if ((pos | iov_iter_alignment(iter)) &
-	    (bdev_logical_block_size(bdev) - 1))
+	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
+		return -EINVAL;
+	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
 		return -EINVAL;
 
 	bio = bio_alloc_kiocb(iocb, bdev, nr_pages, opf, &blkdev_dio_pool);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index aef06e607b40..b3d249d7d91d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1132,7 +1132,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	struct dio_submit sdio = { 0, };
 	struct buffer_head map_bh = { 0, };
 	struct blk_plug plug;
-	unsigned long align = offset | iov_iter_alignment(iter);
+	unsigned long align = iov_iter_alignment(iter);
 
 	/*
 	 * Avoid references to bdev if not absolutely needed to give
@@ -1166,11 +1166,14 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		goto fail_dio;
 	}
 
-	if (align & blocksize_mask) {
-		if (bdev)
+	if ((offset | align) & blocksize_mask) {
+		if (bdev) {
 			blkbits = blksize_bits(bdev_logical_block_size(bdev));
+			if (align & bdev_dma_alignment(bdev))
+				goto fail_dio;
+		}
 		blocksize_mask = (1 << blkbits) - 1;
-		if (align & blocksize_mask)
+		if ((offset | count) & blocksize_mask)
 			goto fail_dio;
 	}
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b08f5dc31780..c73b050b7026 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -243,7 +243,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
-	if ((pos | length | align) & ((1 << blkbits) - 1))
+	if ((pos | length) & ((1 << blkbits) - 1) ||
+	    align & bdev_dma_alignment(iomap->bdev))
 		return -EINVAL;
 
 	if (iomap->type == IOMAP_UNWRITTEN) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60d016138997..dba6d411fc1e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1378,6 +1378,11 @@ static inline int queue_dma_alignment(const struct request_queue *q)
 	return q ? q->dma_alignment : 511;
 }
 
+static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
+{
+	return queue_dma_alignment(bdev_get_queue(bdev));
+}
+
 static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
 				 unsigned int len)
 {
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 3/4] block: add bit bucket support
  2022-05-04 16:32 [RFC 0/4] partial sector read support Keith Busch
  2022-05-04 16:32 ` [RFC 1/4] block: export dma_alignment attribute Keith Busch
  2022-05-04 16:32 ` [RFC 2/4] block: relax direct io memory alignment Keith Busch
@ 2022-05-04 16:32 ` Keith Busch
  2022-05-04 16:32 ` [RFC 4/4] nvme: " Keith Busch
  2022-05-10 22:41 ` [RFC 0/4] partial sector read support Sagi Grimberg
  4 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2022-05-04 16:32 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: axboe, hch, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Bit buckets allow applications to read partial sectors. Add block
support for partial reads if the request_queue supports it.

This implementation designates a special page for bit buckets. Filling
the holes with this special page should let merging operations continue
as normal. Read data should never be sent to this page, and the driver
should instead recognize this special page and set up their scatter
gather accordingly.

The bit_bucket attribute is exported so applications may know if they
can do partial sector reads.

This implementation only works for direct io on raw block devices, and
does not work with pre-registered buffers due to those already coming in
as a bvec.

Requests with bit buckets need to be flagged specially for this since
NVMe needs to know before walking the segments if it should construct a
bitbucket SGL instead of a PRP.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-core.c          |  5 ++++
 block/blk-merge.c         |  3 +-
 block/blk-mq.c            |  2 ++
 block/blk-sysfs.c         |  3 ++
 block/fops.c              | 58 +++++++++++++++++++++++++++++++++------
 include/linux/blk-mq.h    |  2 ++
 include/linux/blk_types.h |  1 +
 include/linux/blkdev.h    | 13 +++++++++
 8 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 937bb6b86331..a11931857dd9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -74,6 +74,9 @@ struct kmem_cache *blk_requestq_srcu_cachep;
  */
 static struct workqueue_struct *kblockd_workqueue;
 
+struct page *blk_bb_page;
+EXPORT_SYMBOL_GPL(blk_bb_page);
+
 /**
  * blk_queue_flag_set - atomically set a queue flag
  * @flag: flag to be set
@@ -1309,5 +1312,7 @@ int __init blk_dev_init(void)
 
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
 
+	blk_bb_page = ZERO_PAGE(0);
+
 	return 0;
 }
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7771dacc99cb..3fde24bf97f3 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -278,7 +278,8 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
+		if (!bio_flagged(bio, BIO_BIT_BUCKET) && bvprvp &&
+		    bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
 			goto split;
 
 		if (nsegs < max_segs &&
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c4370d276170..80309d243a09 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2411,6 +2411,8 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
 
 	if (bio->bi_opf & REQ_RAHEAD)
 		rq->cmd_flags |= REQ_FAILFAST_MASK;
+	if (bio_flagged(bio, BIO_BIT_BUCKET))
+		rq->rq_flags |= RQF_BIT_BUCKET;
 
 	rq->__sector = bio->bi_iter.bi_sector;
 	blk_rq_bio_prep(rq, bio, nr_segs);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 14607565d781..19c385084aea 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -309,6 +309,7 @@ QUEUE_SYSFS_BIT_FNS(nonrot, NONROT, 1);
 QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
 QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
 QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0);
+QUEUE_SYSFS_BIT_FNS(bit_bucket, BIT_BUCKET, 0);
 #undef QUEUE_SYSFS_BIT_FNS
 
 static ssize_t queue_zoned_show(struct request_queue *q, char *page)
@@ -627,6 +628,7 @@ QUEUE_RW_ENTRY(queue_nonrot, "rotational");
 QUEUE_RW_ENTRY(queue_iostats, "iostats");
 QUEUE_RW_ENTRY(queue_random, "add_random");
 QUEUE_RW_ENTRY(queue_stable_writes, "stable_writes");
+QUEUE_RW_ENTRY(queue_bit_bucket, "bit_bucket");
 
 static struct attribute *queue_attrs[] = {
 	&queue_requests_entry.attr,
@@ -653,6 +655,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_zone_append_max_entry.attr,
 	&queue_zone_write_granularity_entry.attr,
 	&queue_nonrot_entry.attr,
+	&queue_bit_bucket_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
 	&queue_max_open_zones_entry.attr,
diff --git a/block/fops.c b/block/fops.c
index a6583bce1e7d..36ccd52ece03 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -57,13 +57,21 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 {
 	struct block_device *bdev = iocb->ki_filp->private_data;
 	struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
+	unsigned int blksz = bdev_logical_block_size(bdev);
 	loff_t pos = iocb->ki_pos;
 	bool should_dirty = false;
+	u16 skip = 0, trunc = 0;
 	struct bio bio;
 	ssize_t ret;
 
-	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
-		return -EINVAL;
+	if ((pos | iov_iter_count(iter)) & (blksz - 1)) {
+		if (iov_iter_rw(iter) != READ || iov_iter_is_bvec(iter) ||
+		    !blk_queue_bb(bdev_get_queue(bdev)))
+			return -EINVAL;
+		skip = pos & (blksz - 1);
+		trunc = blksz - ((pos + iov_iter_count(iter)) & (blksz - 1));
+		nr_pages += !!skip + !!trunc;
+	}
 	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
 		return -EINVAL;
 
@@ -80,6 +88,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (iter_is_iovec(iter))
 			should_dirty = true;
+		if (skip)
+			blk_add_bb_page(&bio, skip);
 	} else {
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
@@ -91,7 +101,10 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
 		goto out;
-	ret = bio.bi_iter.bi_size;
+
+	if (trunc)
+		blk_add_bb_page(&bio, trunc);
+	ret = bio.bi_iter.bi_size - trunc - skip;
 
 	if (iov_iter_rw(iter) == WRITE)
 		task_io_account_write(ret);
@@ -186,16 +199,25 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		unsigned int nr_pages)
 {
 	struct block_device *bdev = iocb->ki_filp->private_data;
+	unsigned int blksz = bdev_logical_block_size(bdev);
 	struct blk_plug plug;
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
 	unsigned int opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	loff_t pos = iocb->ki_pos;
+	u16 skip = 0, trunc = 0, bucket_bytes = 0;
 	int ret = 0;
 
-	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
-		return -EINVAL;
+	if ((pos | iov_iter_count(iter)) & (blksz - 1)) {
+		if (iov_iter_rw(iter) != READ || iov_iter_is_bvec(iter) ||
+		    !blk_queue_bb(bdev_get_queue(bdev)))
+			return -EINVAL;
+		skip = pos & (blksz - 1);
+		trunc = blksz - ((pos + iov_iter_count(iter)) & (blksz - 1));
+		bucket_bytes = skip + trunc;
+		nr_pages += !!skip + !!trunc;
+	}
 	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
 		return -EINVAL;
 
@@ -240,6 +262,10 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		if (is_read) {
 			if (dio->flags & DIO_SHOULD_DIRTY)
 				bio_set_pages_dirty(bio);
+			if (skip) {
+				blk_add_bb_page(bio, skip);
+				skip = 0;
+			}
 		} else {
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
@@ -251,6 +277,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
 		if (!nr_pages) {
+			if (trunc)
+				blk_add_bb_page(bio, trunc);
 			submit_bio(bio);
 			break;
 		}
@@ -275,7 +303,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	if (!ret)
 		ret = blk_status_to_errno(dio->bio.bi_status);
 	if (likely(!ret))
-		ret = dio->size;
+		ret = dio->size - bucket_bytes;
 
 	bio_put(&dio->bio);
 	return ret;
@@ -311,15 +339,23 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 					unsigned int nr_pages)
 {
 	struct block_device *bdev = iocb->ki_filp->private_data;
+	unsigned int blksz = bdev_logical_block_size(bdev);
 	bool is_read = iov_iter_rw(iter) == READ;
 	unsigned int opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	loff_t pos = iocb->ki_pos;
+	u16 skip = 0, trunc = 0;
 	int ret = 0;
 
-	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
-		return -EINVAL;
+	if ((pos | iov_iter_count(iter)) & (blksz - 1)) {
+		if (iov_iter_rw(iter) != READ || iov_iter_is_bvec(iter) ||
+		    !blk_queue_bb(bdev_get_queue(bdev)))
+			return -EINVAL;
+		skip = pos & (blksz - 1);
+		trunc = blksz - ((pos + iov_iter_count(iter)) & (blksz - 1));
+		nr_pages += !!skip + !!trunc;
+	}
 	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
 		return -EINVAL;
 
@@ -340,13 +376,17 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		 */
 		bio_iov_bvec_set(bio, iter);
 	} else {
+		if (skip)
+			blk_add_bb_page(bio, skip);
 		ret = bio_iov_iter_get_pages(bio, iter);
 		if (unlikely(ret)) {
 			bio_put(bio);
 			return ret;
 		}
+		if (trunc)
+			blk_add_bb_page(bio, trunc);
 	}
-	dio->size = bio->bi_iter.bi_size;
+	dio->size = bio->bi_iter.bi_size - trunc - skip;
 
 	if (is_read) {
 		if (iter_is_iovec(iter)) {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7aa5c54901a9..1a3902c2440f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,8 @@ typedef __u32 __bitwise req_flags_t;
 
 /* drive already may have started this one */
 #define RQF_STARTED		((__force req_flags_t)(1 << 1))
+/* request has bit bucket payload */
+#define RQF_BIT_BUCKET         ((__force req_flags_t)(1 << 2))
 /* may not be passed by ioscheduler */
 #define RQF_SOFTBARRIER		((__force req_flags_t)(1 << 3))
 /* request for flush sequence */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1973ef9bd40f..f55e194b72a0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -330,6 +330,7 @@ enum {
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
 	BIO_PERCPU_CACHE,	/* can participate in per-cpu alloc cache */
+	BIO_BIT_BUCKET,		/* contains one or more bit bucket pages */
 	BIO_FLAG_LAST
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dba6d411fc1e..5feaa5e7810e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -44,6 +44,7 @@ struct blk_crypto_profile;
 extern const struct device_type disk_type;
 extern struct device_type part_type;
 extern struct class block_class;
+extern struct page *blk_bb_page;
 
 /* Must be consistent with blk_mq_poll_stats_bkt() */
 #define BLK_MQ_POLL_STATS_BKTS 16
@@ -560,6 +561,7 @@ struct request_queue {
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
 #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
+#define QUEUE_FLAG_BIT_BUCKET   30	/* device supports read bit buckets */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP) |		\
@@ -605,6 +607,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_bb(q)		test_bit(QUEUE_FLAG_BIT_BUCKET, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
@@ -1588,4 +1591,14 @@ struct io_comp_batch {
 
 #define DEFINE_IO_COMP_BATCH(name)	struct io_comp_batch name = { }
 
+static inline void blk_add_bb_page(struct bio *bio, int len)
+{
+	bio_set_flag(bio, BIO_BIT_BUCKET);
+	get_page(blk_bb_page);
+	bio_add_page(bio, blk_bb_page, len, 0);
+}
+static inline bool blk_is_bit_bucket(struct page *page)
+{
+	return page == blk_bb_page;
+}
 #endif /* _LINUX_BLKDEV_H */
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 4/4] nvme: add bit bucket support
  2022-05-04 16:32 [RFC 0/4] partial sector read support Keith Busch
                   ` (2 preceding siblings ...)
  2022-05-04 16:32 ` [RFC 3/4] block: add bit bucket support Keith Busch
@ 2022-05-04 16:32 ` Keith Busch
  2022-05-10 22:41 ` [RFC 0/4] partial sector read support Sagi Grimberg
  4 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2022-05-04 16:32 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: axboe, hch, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Set the queue for bit bucket support if the hardware and driver support
it. The nvme pci driver will recognize the special bitbucket page for
read commands and set up an appropriate sg descriptor for it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c |  3 +++
 drivers/nvme/host/nvme.h |  6 ++++++
 drivers/nvme/host/pci.c  | 17 +++++++++++++++--
 include/linux/nvme.h     |  2 ++
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e1846d04817f..bea054565eed 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3928,6 +3928,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
 	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
+	if (nvme_ctrl_sgl_bb_supported(ctrl) && ctrl->ops->flags & NVME_F_BB)
+		blk_queue_flag_set(QUEUE_FLAG_BIT_BUCKET, ns->queue);
+
 	ns->ctrl = ctrl;
 	kref_init(&ns->kref);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a2b53ca63335..91d75f95fe39 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -495,6 +495,7 @@ struct nvme_ctrl_ops {
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
 #define NVME_F_PCI_P2PDMA		(1 << 2)
+#define NVME_F_BB			(1 << 3)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -957,6 +958,11 @@ static inline bool nvme_ctrl_sgl_supported(struct nvme_ctrl *ctrl)
 	return ctrl->sgls & ((1 << 0) | (1 << 1));
 }
 
+static inline bool nvme_ctrl_sgl_bb_supported(struct nvme_ctrl *ctrl)
+{
+	return ctrl->sgls & (1 << 16);
+}
+
 u32 nvme_command_effects(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 			 u8 opcode);
 int nvme_execute_passthru_rq(struct request *rq);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aacf1c0d5a5..83e057f44867 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -535,6 +535,8 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
 
 	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
 
+	if (req->rq_flags & RQF_BIT_BUCKET)
+		return true;
 	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
 		return false;
 	if (!iod->nvmeq->qid)
@@ -724,6 +726,13 @@ static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
 	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
+static void nvme_pci_sgl_set_bb(struct nvme_sgl_desc *sge,
+				struct scatterlist *sg)
+{
+	sge->length = cpu_to_le32(sg_dma_len(sg));
+	sge->type = NVME_SGL_FMT_BB_DESC << 4;
+}
+
 static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 		dma_addr_t dma_addr, int entries)
 {
@@ -789,7 +798,10 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 			nvme_pci_sgl_set_seg(link, sgl_dma, entries);
 		}
 
-		nvme_pci_sgl_set_data(&sg_list[i++], sg);
+		if (rq_data_dir(req) == READ && blk_is_bit_bucket(sg_page(sg)))
+			nvme_pci_sgl_set_bb(&sg_list[i++], sg);
+		else
+			nvme_pci_sgl_set_data(&sg_list[i++], sg);
 		sg = sg_next(sg);
 	} while (--entries > 0);
 
@@ -2973,7 +2985,8 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
 	.flags			= NVME_F_METADATA_SUPPORTED |
-				  NVME_F_PCI_P2PDMA,
+				  NVME_F_PCI_P2PDMA |
+				  NVME_F_BB,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index f626a445d1a8..27d568633e6e 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -796,6 +796,7 @@ enum {
  *
  * For struct nvme_sgl_desc:
  *   @NVME_SGL_FMT_DATA_DESC:		data block descriptor
+ *   @NVME_SGL_FMT_BB_DESC:		bit buckect descriptor
  *   @NVME_SGL_FMT_SEG_DESC:		sgl segment descriptor
  *   @NVME_SGL_FMT_LAST_SEG_DESC:	last sgl segment descriptor
  *
@@ -807,6 +808,7 @@ enum {
  */
 enum {
 	NVME_SGL_FMT_DATA_DESC		= 0x00,
+	NVME_SGL_FMT_BB_DESC		= 0x01,
 	NVME_SGL_FMT_SEG_DESC		= 0x02,
 	NVME_SGL_FMT_LAST_SEG_DESC	= 0x03,
 	NVME_KEY_SGL_FMT_DATA_DESC	= 0x04,
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC 0/4] partial sector read support
  2022-05-04 16:32 [RFC 0/4] partial sector read support Keith Busch
                   ` (3 preceding siblings ...)
  2022-05-04 16:32 ` [RFC 4/4] nvme: " Keith Busch
@ 2022-05-10 22:41 ` Sagi Grimberg
  2022-05-11  0:15   ` Keith Busch
  4 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2022-05-10 22:41 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block; +Cc: axboe, hch, Keith Busch


> From: Keith Busch <kbusch@kernel.org>
> 
> Don't you just hate that you must read a full sector when you only cared
> about a few bytes? Well, good news! Standardized protocols provide a way
> for the host to describe unwanted read data, allowing partial sector
> access. This lets applications reduce allocated memory that was used to
> hold the unwanted data, while also reducing link traffic.
> 
> This series enables this for the NVMe protocol through direct io. With
> this, a userspace app can read a single contiguous byte range, subject
> to hardware DMA constraints. An in-kernel user could theorectically
> construct a bio to read multiple discontiguous ranges if desired.

So userspace needs to look for both bb queue flag and dma alignment to
infer supported range(s)?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 0/4] partial sector read support
  2022-05-10 22:41 ` [RFC 0/4] partial sector read support Sagi Grimberg
@ 2022-05-11  0:15   ` Keith Busch
  0 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2022-05-11  0:15 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, linux-nvme, axboe, hch

On Tue, May 10, 2022 at 03:41:35PM -0700, Sagi Grimberg wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > Don't you just hate that you must read a full sector when you only cared
> > about a few bytes? Well, good news! Standardized protocols provide a way
> > for the host to describe unwanted read data, allowing partial sector
> > access. This lets applications reduce allocated memory that was used to
> > hold the unwanted data, while also reducing link traffic.
> > 
> > This series enables this for the NVMe protocol through direct io. With
> > this, a userspace app can read a single contiguous byte range, subject
> > to hardware DMA constraints. An in-kernel user could theorectically
> > construct a bio to read multiple discontiguous ranges if desired.
> 
> So userspace needs to look for both bb queue flag and dma alignment to
> infer supported range(s)?

If you don't know ahead of time, then yes, userspace will need to query the
attributes. There are existing attributes that inform userspace of unique
hardware capabilities that affect their access: 'dax' and 'zoned' are such
examples that come to mind.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-05-11  0:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-04 16:32 [RFC 0/4] partial sector read support Keith Busch
2022-05-04 16:32 ` [RFC 1/4] block: export dma_alignment attribute Keith Busch
2022-05-04 16:32 ` [RFC 2/4] block: relax direct io memory alignment Keith Busch
2022-05-04 16:32 ` [RFC 3/4] block: add bit bucket support Keith Busch
2022-05-04 16:32 ` [RFC 4/4] nvme: " Keith Busch
2022-05-10 22:41 ` [RFC 0/4] partial sector read support Sagi Grimberg
2022-05-11  0:15   ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox