linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv9 0/7] write hints with nvme fdp, scsi streams
@ 2024-10-25 21:36 Keith Busch
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <kbusch@kernel.org>

A little something for everyone here.

Upfront, I really didn't get the feedback about setting different flags
for stream vs. temperature support. Who wants to use it, and where and
how is that information used?

Changes from v8:

  Added reviews.

  Removed an unused header.

  Changed "hint" to "streams" in the commit logs.

  Ability to split available hints that a partition can use.

  Dropped all the generic filesystem changes that were defaulting to the
  kiocb write_hint. They are unchanged, which having no functional
  change was really the intention anyway, so let's just not change them.

  The above means we don't need a special fop flag to indicate support
  for the kiocb write_hint. Those filesystems that don't support it
  simply don't read it.

  Added the SCSI support since I had to read the spec anyway, and it is
  just a one-line change.

Kanchan Joshi (2):
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (5):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  block: allow ability to limit partition write hints
  block, fs: add write hint to kiocb
  scsi: set permanent stream count in block limits

 Documentation/ABI/stable/sysfs-block |  7 +++
 block/bdev.c                         | 15 +++++
 block/blk-settings.c                 |  3 +
 block/blk-sysfs.c                    |  3 +
 block/fops.c                         | 26 ++++++++-
 block/partitions/core.c              | 46 +++++++++++++++-
 drivers/nvme/host/core.c             | 82 ++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h             |  5 ++
 drivers/scsi/sd.c                    |  2 +
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  4 +-
 include/linux/blkdev.h               | 12 ++++
 include/linux/fs.h                   |  1 +
 include/linux/nvme.h                 | 19 +++++++
 include/uapi/linux/io_uring.h        |  4 ++
 io_uring/rw.c                        |  3 +-
 16 files changed, 225 insertions(+), 10 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 11:58   ` Christoph Hellwig
                     ` (2 more replies)
  2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 20+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <kbusch@kernel.org>

When multiple partitions are used, you may want to enforce different
subsets of the available write hints for each partition. Provide a
bitmap attribute of the available write hints, and allow an admin to
write a different mask to set the partition's allowed write hints.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bdev.c              | 15 +++++++++++++
 block/partitions/core.c   | 46 +++++++++++++++++++++++++++++++++++++--
 include/linux/blk_types.h |  1 +
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7f..5d23648db457b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -414,6 +414,7 @@ void __init bdev_cache_init(void)
 
 struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 {
+	unsigned short max_write_hints;
 	struct block_device *bdev;
 	struct inode *inode;
 
@@ -440,6 +441,20 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 		return NULL;
 	}
 	bdev->bd_disk = disk;
+
+	max_write_hints = bdev_max_write_hints(bdev);
+	if (max_write_hints) {
+		int size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
+
+		bdev->write_hint_mask = kmalloc(size, GFP_KERNEL);
+		if (!bdev->write_hint_mask) {
+			free_percpu(bdev->bd_stats);
+			iput(inode);
+			return NULL;
+		}
+		memset(bdev->write_hint_mask, 0xff, size);
+	}
+
 	return bdev;
 }
 
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 815ed33caa1b8..c0ea0a7b6fa87 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -203,6 +203,42 @@ static ssize_t part_discard_alignment_show(struct device *dev,
 	return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
 }
 
+static ssize_t part_write_hint_mask_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	unsigned short max_write_hints = bdev_max_write_hints(bdev);
+
+	if (max_write_hints)
+		return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);
+	else
+		return sprintf(buf, "0");
+}
+
+static ssize_t part_write_hint_mask_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	unsigned short max_write_hints = bdev_max_write_hints(bdev);
+	unsigned long *new_mask;
+	int size;
+
+	if (!max_write_hints)
+		return count;
+
+	size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
+	new_mask = kzalloc(size, GFP_KERNEL);
+	if (!new_mask)
+		return -ENOMEM;
+
+	bitmap_parse(buf, count, new_mask, max_write_hints);
+	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);
+
+	return count;
+}
+
 static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
 static DEVICE_ATTR(start, 0444, part_start_show, NULL);
 static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -211,6 +247,8 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
 static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
 static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(write_hint_mask, 0644, part_write_hint_mask_show,
+		   part_write_hint_mask_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
 	__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -225,6 +263,7 @@ static struct attribute *part_attrs[] = {
 	&dev_attr_discard_alignment.attr,
 	&dev_attr_stat.attr,
 	&dev_attr_inflight.attr,
+	&dev_attr_write_hint_mask.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	&dev_attr_fail.attr,
 #endif
@@ -245,8 +284,11 @@ static const struct attribute_group *part_attr_groups[] = {
 
 static void part_release(struct device *dev)
 {
-	put_disk(dev_to_bdev(dev)->bd_disk);
-	bdev_drop(dev_to_bdev(dev));
+	struct block_device *part = dev_to_bdev(dev);
+
+	kfree(part->write_hint_mask);
+	put_disk(part->bd_disk);
+	bdev_drop(part);
 }
 
 static int part_uevent(const struct device *dev, struct kobj_uevent_env *env)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6737795220e18..af430e543f7f7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -73,6 +73,7 @@ struct block_device {
 #ifdef CONFIG_SECURITY
 	void			*bd_security;
 #endif
+	unsigned long		*write_hint_mask;
 	/*
 	 * keep this out-of-line as it's both big and not needed in the fast
 	 * path
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 11:59   ` Christoph Hellwig
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <kbusch@kernel.org>

This prepares for sources other than the inode to provide a write hint.
The block layer will use it for direct IO if the requested hint is
within the block device's capabilities.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c       | 26 +++++++++++++++++++++++---
 include/linux/fs.h |  1 +
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 2d01c90076813..e3f3f1957d86d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -71,7 +71,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio.bi_write_hint = iocb->ki_write_hint;
 	bio.bi_ioprio = iocb->ki_ioprio;
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
@@ -200,7 +200,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	for (;;) {
 		bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-		bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+		bio->bi_write_hint = iocb->ki_write_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
@@ -316,7 +316,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->flags = 0;
 	dio->iocb = iocb;
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = iocb->ki_write_hint;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
 
@@ -362,6 +362,23 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	return -EIOCBQUEUED;
 }
 
+static u16 blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
+{
+	u16 hint = iocb->ki_write_hint;
+
+	if (!hint)
+		return file_inode(iocb->ki_filp)->i_write_hint;
+
+	if (hint > bdev_max_write_hints(bdev))
+		return file_inode(iocb->ki_filp)->i_write_hint;
+
+	if (bdev_is_partition(bdev) &&
+	    !test_bit(hint - 1, bdev->write_hint_mask))
+		return file_inode(iocb->ki_filp)->i_write_hint;
+
+	return hint;
+}
+
 static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
@@ -373,6 +390,9 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (blkdev_dio_invalid(bdev, iocb, iter))
 		return -EINVAL;
 
+	if (iov_iter_rw(iter) == WRITE)
+		iocb->ki_write_hint = blkdev_write_hint(iocb, bdev);
+
 	nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
 	if (likely(nr_pages <= BIO_MAX_VECS)) {
 		if (is_sync_kiocb(iocb))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b5cad44a1268..1a00accf412e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	u16			ki_write_hint;
 	union {
 		/*
 		 * Only used for async buffered reads, where it denotes the
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCHv9 7/7] scsi: set permanent stream count in block limits
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
  2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 16:13   ` Bart Van Assche
  2024-10-29  7:10   ` Hannes Reinecke
  2024-10-28 11:49 ` [PATCHv9 0/7] write hints with nvme fdp, scsi streams Christoph Hellwig
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The block limits exports the number of write hints, so set this limit if
the device reports support for the lifetime hints. Not only does this
inform the user of which hints are possible, it also allows scsi devices
supporting the feature to utilize the full range through raw block
device direct-io.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/scsi/sd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ca4bc0ac76adc..235dd6e5b6688 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		sd_config_protection(sdkp, &lim);
 	}
 
+	lim.max_write_hints = sdkp->permanent_stream_count;
+
 	/*
 	 * We now have all cache related info, determine how we deal
 	 * with flush requests.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 0/7] write hints with nvme fdp, scsi streams
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
                   ` (2 preceding siblings ...)
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-28 11:49 ` Christoph Hellwig
       [not found] ` <20241025213645.3464331-3-kbusch@meta.com>
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Fri, Oct 25, 2024 at 02:36:38PM -0700, Keith Busch wrote:
> Upfront, I really didn't get the feedback about setting different flags
> for stream vs. temperature support. Who wants to use it, and where and
> how is that information used?

The SBC permanent streams have a 'relative lifetime' associated with
them.  So if you expose them as plain write streams you violate the
contract with the device.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 2/7] block: introduce max_write_hints queue limit
       [not found] ` <20241025213645.3464331-3-kbusch@meta.com>
@ 2024-10-28 11:51   ` Christoph Hellwig
  2024-10-28 11:52     ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch, Hannes Reinecke

On Fri, Oct 25, 2024 at 02:36:40PM -0700, Keith Busch wrote:
> +static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
> +{
> +	return queue_max_write_hints(bdev_get_queue(bdev));
> +}

As pointed out by Bart last time, you can't simply give the write hints
to all block device.  Assume we'd want to wire up the write stream based
separate to f2fs (which btw would be a good demonstration), and you'd
have two different f2fs file systems on separate partitions that'd
now start sharing the write streams if they simply started from stream
1.  Same for our pending XFS data placement work.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 2/7] block: introduce max_write_hints queue limit
  2024-10-28 11:51   ` [PATCHv9 2/7] block: introduce max_write_hints queue limit Christoph Hellwig
@ 2024-10-28 11:52     ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch, Hannes Reinecke

On Mon, Oct 28, 2024 at 12:51:32PM +0100, Christoph Hellwig wrote:
> As pointed out by Bart last time, you can't simply give the write hints
> to all block device.  Assume we'd want to wire up the write stream based
> separate to f2fs (which btw would be a good demonstration), and you'd
> have two different f2fs file systems on separate partitions that'd
> now start sharing the write streams if they simply started from stream
> 1.  Same for our pending XFS data placement work.

And I'm an idiot and should have looked at the next patch patch first.
Sorry for that.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-28 11:58   ` Christoph Hellwig
  2024-10-28 14:49     ` Keith Busch
  2024-10-28 14:40   ` Kanchan Joshi
  2024-10-28 18:27   ` Bart Van Assche
  2 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Fri, Oct 25, 2024 at 02:36:41PM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> When multiple partitions are used, you may want to enforce different
> subsets of the available write hints for each partition. Provide a
> bitmap attribute of the available write hints, and allow an admin to
> write a different mask to set the partition's allowed write hints.

Trying my best Greg impersonator voice: This needs to be documented
in Documentation/ABI/stable/sysfs-block.

That would have also helped me understanding it.  AFAIK the split here
is an opt-in, which means the use case I explained in the previous
case would still not work out of the box, right?

> +	max_write_hints = bdev_max_write_hints(bdev);
> +	if (max_write_hints) {
> +		int size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
> +
> +		bdev->write_hint_mask = kmalloc(size, GFP_KERNEL);
> +		if (!bdev->write_hint_mask) {
> +			free_percpu(bdev->bd_stats);
> +			iput(inode);
> +			return NULL;
> +		}
> +		memset(bdev->write_hint_mask, 0xff, size);
> +	}

This could simply use bitmap_alloc().  Similarly the other uses
would probably benefit from using the bitmap API.

> +	struct block_device *bdev = dev_to_bdev(dev);
> +	unsigned short max_write_hints = bdev_max_write_hints(bdev);
> +
> +	if (max_write_hints)
> +		return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);
> +	else
> +		return sprintf(buf, "0");

No need for the else.  And if you write this as:

	if (!max_write_hints)
		return sprintf(buf, "0");
	return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);

you'd also avoid the overly long line.

> +
> +static ssize_t part_write_hint_mask_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev = dev_to_bdev(dev);
> +	unsigned short max_write_hints = bdev_max_write_hints(bdev);
> +	unsigned long *new_mask;
> +	int size;
> +
> +	if (!max_write_hints)
> +		return count;
> +
> +	size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
> +	new_mask = kzalloc(size, GFP_KERNEL);
> +	if (!new_mask)
> +		return -ENOMEM;
> +
> +	bitmap_parse(buf, count, new_mask, max_write_hints);
> +	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);

What protects access to bdev->write_hint_mask?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
@ 2024-10-28 11:59   ` Christoph Hellwig
  2024-10-28 14:38     ` Keith Busch
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Fri, Oct 25, 2024 at 02:36:42PM -0700, Keith Busch wrote:
> +static u16 blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
> +{
> +	u16 hint = iocb->ki_write_hint;
> +
> +	if (!hint)
> +		return file_inode(iocb->ki_filp)->i_write_hint;
> +
> +	if (hint > bdev_max_write_hints(bdev))
> +		return file_inode(iocb->ki_filp)->i_write_hint;
> +
> +	if (bdev_is_partition(bdev) &&
> +	    !test_bit(hint - 1, bdev->write_hint_mask))
> +		return file_inode(iocb->ki_filp)->i_write_hint;

I would have expected an error when using an invalid stream identifier.

That of course requires telling the application how many are available
through e.g. statx as requested last time.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-28 11:59   ` Christoph Hellwig
@ 2024-10-28 14:38     ` Keith Busch
  2024-10-28 16:08       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2024-10-28 14:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche

On Mon, Oct 28, 2024 at 12:59:32PM +0100, Christoph Hellwig wrote:
> On Fri, Oct 25, 2024 at 02:36:42PM -0700, Keith Busch wrote:
> > +static u16 blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
> > +{
> > +	u16 hint = iocb->ki_write_hint;
> > +
> > +	if (!hint)
> > +		return file_inode(iocb->ki_filp)->i_write_hint;
> > +
> > +	if (hint > bdev_max_write_hints(bdev))
> > +		return file_inode(iocb->ki_filp)->i_write_hint;
> > +
> > +	if (bdev_is_partition(bdev) &&
> > +	    !test_bit(hint - 1, bdev->write_hint_mask))
> > +		return file_inode(iocb->ki_filp)->i_write_hint;
> 
> I would have expected an error when using an invalid stream identifier.

It's a hint. fcntl doesn't error if you give an unusable hint, so
neither should this. You get sane default behavior.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
  2024-10-28 11:58   ` Christoph Hellwig
@ 2024-10-28 14:40   ` Kanchan Joshi
  2024-10-28 18:27   ` Bart Van Assche
  2 siblings, 0 replies; 20+ messages in thread
From: Kanchan Joshi @ 2024-10-28 14:40 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, javier.gonz, bvanassche, Keith Busch

On 10/26/2024 3:06 AM, Keith Busch wrote:
> +static ssize_t part_write_hint_mask_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev = dev_to_bdev(dev);
> +	unsigned short max_write_hints = bdev_max_write_hints(bdev);
> +	unsigned long *new_mask;
> +	int size;
> +
> +	if (!max_write_hints)
> +		return count;
> +
> +	size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
> +	new_mask = kzalloc(size, GFP_KERNEL);
> +	if (!new_mask)
> +		return -ENOMEM;
> +
> +	bitmap_parse(buf, count, new_mask, max_write_hints);
> +	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);

kfree(new_mask) here.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-28 11:58   ` Christoph Hellwig
@ 2024-10-28 14:49     ` Keith Busch
  0 siblings, 0 replies; 20+ messages in thread
From: Keith Busch @ 2024-10-28 14:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche

On Mon, Oct 28, 2024 at 12:58:05PM +0100, Christoph Hellwig wrote:
> On Fri, Oct 25, 2024 at 02:36:41PM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > When multiple partitions are used, you may want to enforce different
> > subsets of the available write hints for each partition. Provide a
> > bitmap attribute of the available write hints, and allow an admin to
> > write a different mask to set the partition's allowed write hints.
> 
> Trying my best Greg impersonator voice: This needs to be documented
> in Documentation/ABI/stable/sysfs-block.
> 
> That would have also helped me understanding it.  AFAIK the split here
> is an opt-in, which means the use case I explained in the previous
> case would still not work out of the box, right?

Right.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-28 14:38     ` Keith Busch
@ 2024-10-28 16:08       ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-10-28 16:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche

On Mon, Oct 28, 2024 at 08:38:05AM -0600, Keith Busch wrote:
> > > +	if (bdev_is_partition(bdev) &&
> > > +	    !test_bit(hint - 1, bdev->write_hint_mask))
> > > +		return file_inode(iocb->ki_filp)->i_write_hint;
> > 
> > I would have expected an error when using an invalid stream identifier.
> 
> It's a hint. fcntl doesn't error if you give an unusable hint, so
> neither should this. You get sane default behavior.

Well, why does it have to be a hint?

If I have a data placement aware application I really want to know
into how many buckets I can sort and adjust my algorithm for it.
And I'd rather have an error checked interface to pass that down
as far as I can.

Same for my file system use case.  I guess you have a use case where
a hint would be enough, at least with good enough knowledge of the
underlying implementation.  But would there be an actual downside
in not having a hint?  Because historically speaking everything we've
done as a not error checked vaguely defined hint has not been all
the useful, but it in hardware or software interfaces.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 7/7] scsi: set permanent stream count in block limits
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-28 16:13   ` Bart Van Assche
  2024-10-29  7:10   ` Hannes Reinecke
  1 sibling, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2024-10-28 16:13 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch

On 10/25/24 2:36 PM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/scsi/sd.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index ca4bc0ac76adc..235dd6e5b6688 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
>   		sd_config_protection(sdkp, &lim);
>   	}
>   
> +	lim.max_write_hints = sdkp->permanent_stream_count;
> +
>   	/*
>   	 * We now have all cache related info, determine how we deal
>   	 * with flush requests.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 1/7] block: use generic u16 for write hints
       [not found] ` <20241025213645.3464331-2-kbusch@meta.com>
@ 2024-10-28 18:19   ` Bart Van Assche
  2024-10-28 18:38     ` Keith Busch
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2024-10-28 18:19 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch,
	Hannes Reinecke

On 10/25/24 2:36 PM, Keith Busch wrote:
> This is still backwards compatible with lifetime hints. It just doesn't
> constrain the hints to that definition.

Since struct bio is modified, and since it is important not to increase
the size of struct bio, some comments about whether or not the size of
struct bio is affected would be welcome.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
  2024-10-28 11:58   ` Christoph Hellwig
  2024-10-28 14:40   ` Kanchan Joshi
@ 2024-10-28 18:27   ` Bart Van Assche
  2024-10-28 19:46     ` Keith Busch
  2 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2024-10-28 18:27 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch

On 10/25/24 2:36 PM, Keith Busch wrote:
> When multiple partitions are used, you may want to enforce different
> subsets of the available write hints for each partition. Provide a
> bitmap attribute of the available write hints, and allow an admin to
> write a different mask to set the partition's allowed write hints.

After /proc/irq/*/smp_affinity was introduced (a bitmask),
/proc/irq/*/smp_affinity_list (set of ranges) was introduced as a more
user-friendly alternative. Is the same expected to happen with the
write_hint_mask? If so, shouldn't we skip the bitmask user space
interface and directly introduce the more user friendly interface (set
of ranges)?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 1/7] block: use generic u16 for write hints
  2024-10-28 18:19   ` [PATCHv9 1/7] block: use generic u16 for write hints Bart Van Assche
@ 2024-10-28 18:38     ` Keith Busch
  0 siblings, 0 replies; 20+ messages in thread
From: Keith Busch @ 2024-10-28 18:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, hch, joshi.k, javier.gonz, Hannes Reinecke

On Mon, Oct 28, 2024 at 11:19:33AM -0700, Bart Van Assche wrote:
> On 10/25/24 2:36 PM, Keith Busch wrote:
> > This is still backwards compatible with lifetime hints. It just doesn't
> > constrain the hints to that definition.
> 
> Since struct bio is modified, and since it is important not to increase
> the size of struct bio, some comments about whether or not the size of
> struct bio is affected would be welcome.

Sure. The type just shrinks a hole from 2 bytes to 1. Total size remains
the same.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-28 18:27   ` Bart Van Assche
@ 2024-10-28 19:46     ` Keith Busch
  0 siblings, 0 replies; 20+ messages in thread
From: Keith Busch @ 2024-10-28 19:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, hch, joshi.k, javier.gonz

On Mon, Oct 28, 2024 at 11:27:33AM -0700, Bart Van Assche wrote:
> On 10/25/24 2:36 PM, Keith Busch wrote:
> > When multiple partitions are used, you may want to enforce different
> > subsets of the available write hints for each partition. Provide a
> > bitmap attribute of the available write hints, and allow an admin to
> > write a different mask to set the partition's allowed write hints.
> 
> After /proc/irq/*/smp_affinity was introduced (a bitmask),
> /proc/irq/*/smp_affinity_list (set of ranges) was introduced as a more
> user-friendly alternative. Is the same expected to happen with the
> write_hint_mask? If so, shouldn't we skip the bitmask user space
> interface and directly introduce the more user friendly interface (set
> of ranges)?

I don't much of have an opinion either way. One thing I like for the
bitmask representation is you write 0 to turn it off vs. the list type
writes a null string. Writing 0 to disable just feels more natural to
me, but not a big deal.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 7/7] scsi: set permanent stream count in block limits
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
  2024-10-28 16:13   ` Bart Van Assche
@ 2024-10-29  7:10   ` Hannes Reinecke
  1 sibling, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2024-10-29  7:10 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

On 10/25/24 23:36, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/scsi/sd.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index ca4bc0ac76adc..235dd6e5b6688 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
>   		sd_config_protection(sdkp, &lim);
>   	}
>   
> +	lim.max_write_hints = sdkp->permanent_stream_count;
> +
>   	/*
>   	 * We now have all cache related info, determine how we deal
>   	 * with flush requests.
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv9 5/7] io_uring: enable per-io hinting capability
       [not found] ` <20241025213645.3464331-6-kbusch@meta.com>
@ 2024-10-29 12:46   ` Anuj gupta
  0 siblings, 0 replies; 20+ messages in thread
From: Anuj gupta @ 2024-10-29 12:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Hannes Reinecke, Nitesh Shetty,
	Keith Busch

On Sat, Oct 26, 2024 at 3:13 AM Keith Busch <kbusch@meta.com> wrote:
>
> From: Kanchan Joshi <joshi.k@samsung.com>
>
> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
> all the subsequent writes on the file pass that hint value down. This
> can be limiting for block device as all the writes will be tagged with
> only one lifetime hint value. Concurrent writes (with different hint
> values) are hard to manage. Per-IO hinting solves that problem.
>
> Allow userspace to pass additional metadata in the SQE.
>
>         __u16 write_hint;
>
> If the hint is provided, filesystems may optionally use it. A filesytem
> may ignore this field if it does not support per-io hints, or if the
> value is invalid for its backing storage. Just like the inode hints,
> requesting values that are not supported by the hardware are not an
> error.
>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  include/uapi/linux/io_uring.h | 4 ++++
>  io_uring/rw.c                 | 3 ++-
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 60b9c98595faf..8cdcc461d464c 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -92,6 +92,10 @@ struct io_uring_sqe {
>                         __u16   addr_len;
>                         __u16   __pad3[1];
>                 };
> +               struct {
> +                       __u16   write_hint;
> +                       __u16   __pad4[1];
> +               };
>         };
>         union {
>                 struct {
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 8080ffd6d5712..5a1231bfecc3a 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>                 rw->kiocb.ki_ioprio = get_current_ioprio();
>         }
>         rw->kiocb.dio_complete = NULL;
> -
> +       if (ddir == ITER_SOURCE)
> +               rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
>         rw->addr = READ_ONCE(sqe->addr);
>         rw->len = READ_ONCE(sqe->len);
>         rw->flags = READ_ONCE(sqe->rw_flags);
> --
> 2.43.5
>

Since this patch adds a couple of new fields, it makes sense to add
BUILD_BUG_ON() checks in io_uring_init for these fields to assert the
layout of struct io_uring_sqe. And probably a zero check for pad4 in
io_prep_rw.
--
Anuj Gupta

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-10-29 12:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
2024-10-28 11:58   ` Christoph Hellwig
2024-10-28 14:49     ` Keith Busch
2024-10-28 14:40   ` Kanchan Joshi
2024-10-28 18:27   ` Bart Van Assche
2024-10-28 19:46     ` Keith Busch
2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
2024-10-28 11:59   ` Christoph Hellwig
2024-10-28 14:38     ` Keith Busch
2024-10-28 16:08       ` Christoph Hellwig
2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
2024-10-28 16:13   ` Bart Van Assche
2024-10-29  7:10   ` Hannes Reinecke
2024-10-28 11:49 ` [PATCHv9 0/7] write hints with nvme fdp, scsi streams Christoph Hellwig
     [not found] ` <20241025213645.3464331-3-kbusch@meta.com>
2024-10-28 11:51   ` [PATCHv9 2/7] block: introduce max_write_hints queue limit Christoph Hellwig
2024-10-28 11:52     ` Christoph Hellwig
     [not found] ` <20241025213645.3464331-2-kbusch@meta.com>
2024-10-28 18:19   ` [PATCHv9 1/7] block: use generic u16 for write hints Bart Van Assche
2024-10-28 18:38     ` Keith Busch
     [not found] ` <20241025213645.3464331-6-kbusch@meta.com>
2024-10-29 12:46   ` [PATCHv9 5/7] io_uring: enable per-io hinting capability Anuj gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).