linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/11] Add support for write life time hints
@ 2017-06-13 17:15 Jens Axboe
  2017-06-13 17:15 ` [PATCH 01/11] block: add support for carrying stream information in a bio Jens Axboe
                   ` (11 more replies)
  0 siblings, 12 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block

A new iteration of this patchset, previously known as write streams.
Instead of exposing numeric values for streams, I've previously
advocated for just doing a set of hints that makes sense instead. See
the coverage from the LSFMM summit this year:

https://lwn.net/Articles/717755/

This patchset attempts to do that. We define 4 flags for the pwritev2
system call:

RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
			a high overwrite rate, or life time.

RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT

RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM

RWF_WRITE_LIFE_EXTREME	Longer life time than LONG

The idea is that these are relative values, so an application can
use them as they see fit. The underlying device can then place
data appropriately, or be free to ignore the hint. It's just a hint.

Comments appreciated. A branch based on current master can be pulled
from here:

git://git.kernel.dk/linux-block write-stream.1

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/11] block: add support for carrying stream information in a bio
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:01   ` Andreas Dilger
  2017-06-13 17:15 ` [PATCH 02/11] block: add definition for support data write life times Jens Axboe
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

No functional changes in this patch, but it allows each bio to
carry a stream ID and disallows merging of bio's with different
stream IDs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               |  2 ++
 block/blk-merge.c         | 14 ++++++++++++++
 include/linux/blk_types.h | 17 +++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..77f4be1f7428 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -595,6 +595,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_opf = bio_src->bi_opf;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
+	bio->bi_stream = bio_src->bi_stream;
 
 	bio_clone_blkcg_association(bio, bio_src);
 }
@@ -678,6 +679,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
 	bio->bi_opf		= bio_src->bi_opf;
 	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
+	bio->bi_stream		= bio_src->bi_stream;
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..28998acdd675 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,13 @@ static struct request *attempt_merge(struct request_queue *q,
 		return NULL;
 
 	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if (req->bio->bi_stream != next->bio->bi_stream)
+		return NULL;
+
+	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
 	 * will have updated segment counts, update sector
@@ -811,6 +818,13 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !blk_write_same_mergeable(rq->bio, bio))
 		return false;
 
+	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if (rq->bio->bi_stream != bio->bi_stream)
+		return false;
+
 	return true;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..1940876a7fa7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -36,6 +36,8 @@ struct bio {
 	unsigned short		bi_flags;	/* status, etc and bvec pool number */
 	unsigned short		bi_ioprio;
 
+	unsigned int		bi_stream;	/* write life time hint */
+
 	struct bvec_iter	bi_iter;
 
 	/* Number of segments in this BIO after
@@ -312,4 +314,19 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+static inline void bio_set_streamid(struct bio *bio, unsigned int stream)
+{
+	bio->bi_stream = stream;
+}
+
+static inline bool bio_stream_valid(struct bio *bio)
+{
+	return bio->bi_stream != 0;
+}
+
+static inline unsigned int bio_stream(struct bio *bio)
+{
+	return bio->bi_stream;
+}
+
 #endif /* __LINUX_BLK_TYPES_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/11] block: add definition for support data write life times
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
  2017-06-13 17:15 ` [PATCH 01/11] block: add support for carrying stream information in a bio Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:07   ` Andreas Dilger
  2017-06-13 17:15 ` [PATCH 03/11] blk-mq: expose stream write stats through debugfs Jens Axboe
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/blk_types.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1940876a7fa7..16c4e657807f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -314,6 +314,19 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+/*
+ * Data life time definitions for writes. These are relative within a
+ * given device, no absolute meaning is attached to them.
+ */
+enum {
+	WRITE_LIFE_UNKNOWN	= 0,
+	WRITE_LIFE_SHORT,
+	WRITE_LIFE_MEDIUM,
+	WRITE_LIFE_LONG,
+	WRITE_LIFE_EXTREME,
+	WRITE_LIFE_NR,
+};
+
 static inline void bio_set_streamid(struct bio *bio, unsigned int stream)
 {
 	bio->bi_stream = stream;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/11] blk-mq: expose stream write stats through debugfs
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
  2017-06-13 17:15 ` [PATCH 01/11] block: add support for carrying stream information in a bio Jens Axboe
  2017-06-13 17:15 ` [PATCH 02/11] block: add definition for support data write life times Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:21   ` Andreas Dilger
  2017-06-13 17:15 ` [PATCH 04/11] fs: add support for an inode to carry stream related data Jens Axboe
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Useful to verify that things are working the way they should.
Reading the file will return number of kb written to each
stream. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c       |  2 ++
 block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index a7421b772d0e..12e402e4e741 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2057,6 +2057,8 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
+		q->stream_writes[bio_stream(bio)] += bio->bi_iter.bi_size >> 9;
+
 		if (likely(blk_queue_enter(q, false) == 0)) {
 			struct bio_list lower, same;
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 803aed4d7221..a0a480a0b373 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
 	}
 }
 
+static int queue_streams_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	int i;
+
+	for (i = 0; i < WRITE_LIFE_NR; i++)
+		seq_printf(m, "stream%d: %llu\n", i, q->stream_writes[i]);
+
+	return 0;
+}
+
+static ssize_t queue_streams_store(void *data, const char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	struct request_queue *q = data;
+	int i;
+
+	for (i = 0; i < WRITE_LIFE_NR; i++)
+		q->stream_writes[i] = 0;
+
+	return count;
+}
+
 static int queue_poll_stat_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
@@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{"poll_stat", 0400, queue_poll_stat_show},
 	{"state", 0600, queue_state_show, queue_state_write},
+	{"streams", 0600, queue_streams_show, queue_streams_store},
 	{},
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab92c4ea138b..ab623deb20b2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -586,6 +586,8 @@ struct request_queue {
 
 	size_t			cmd_size;
 	void			*rq_alloc_data;
+
+	u64			stream_writes[WRITE_LIFE_NR];
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/11] fs: add support for an inode to carry stream related data
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (2 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 03/11] blk-mq: expose stream write stats through debugfs Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:24   ` [PATCH 04/11] " Andreas Dilger
  2017-06-13 17:15 ` [PATCH 05/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

No functional changes in this patch, just in preparation for
allowing applications to pass in hints about data life times
for writes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/inode.c         |  1 +
 include/linux/fs.h | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..c911a2b9b5a4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -149,6 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_blocks = 0;
 	inode->i_bytes = 0;
 	inode->i_generation = 0;
+	inode->i_stream = WRITE_LIFE_UNKNOWN;
 	inode->i_pipe = NULL;
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..bb8c246eeda8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/workqueue.h>
 #include <linux/delayed_call.h>
+#include <linux/blk_types.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -629,6 +630,7 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+	unsigned int		i_stream;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -655,6 +657,14 @@ struct inode {
 	void			*i_private; /* fs or device private pointer */
 };
 
+static inline unsigned int inode_streamid(struct inode *inode)
+{
+	if (inode)
+		return inode->i_stream;
+
+	return 0;
+}
+
 static inline unsigned int i_blocksize(const struct inode *node)
 {
 	return (1 << node->i_blkbits);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (3 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 04/11] fs: add support for an inode to carry stream related data Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:36   ` [PATCH 05/11] " Andreas Dilger
  2017-06-13 17:15 ` [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Add four flags for the pwritev2(2) system call, allowing an application
to give the kernel a hint about what on-media life times can be
expected from a given write.

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Define IOCB flags to carry this information over, and finally
transform them into the block defined stream values.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/read_write.c         | 17 ++++++++++++++++-
 include/linux/fs.h      | 18 ++++++++++++++++++
 include/uapi/linux/fs.h |  4 ++++
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..1734728aa48b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,7 +678,9 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	struct kiocb kiocb;
 	ssize_t ret;
 
-	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
+	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_SHORT |
+			RWF_WRITE_LIFE_MEDIUM | RWF_WRITE_LIFE_LONG |
+			RWF_WRITE_LIFE_EXTREME))
 		return -EOPNOTSUPP;
 
 	init_sync_kiocb(&kiocb, filp);
@@ -688,6 +690,19 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 		kiocb.ki_flags |= IOCB_DSYNC;
 	if (flags & RWF_SYNC)
 		kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+	if (flags & RWF_WRITE_LIFE_SHORT) {
+		kiocb.ki_flags |= IOCB_WRITE_LIFE_SHORT;
+		file_inode(filp)->i_stream = WRITE_LIFE_SHORT;
+	} else if (flags & RWF_WRITE_LIFE_MEDIUM) {
+		kiocb.ki_flags |= IOCB_WRITE_LIFE_MEDIUM;
+		file_inode(filp)->i_stream = WRITE_LIFE_MEDIUM;
+	} else if (flags & RWF_WRITE_LIFE_LONG) {
+		kiocb.ki_flags |= IOCB_WRITE_LIFE_LONG;
+		file_inode(filp)->i_stream = WRITE_LIFE_LONG;
+	} else if (flags & RWF_WRITE_LIFE_EXTREME) {
+		kiocb.ki_flags |= IOCB_WRITE_LIFE_EXTREME;
+		file_inode(filp)->i_stream = WRITE_LIFE_EXTREME;
+	}
 	kiocb.ki_pos = *ppos;
 
 	if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb8c246eeda8..7f542e0b0e17 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -269,6 +269,10 @@ struct writeback_control;
 #define IOCB_DSYNC		(1 << 4)
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
+#define IOCB_WRITE_LIFE_SHORT	(1 << 7)
+#define IOCB_WRITE_LIFE_MEDIUM	(1 << 8)
+#define IOCB_WRITE_LIFE_LONG	(1 << 9)
+#define IOCB_WRITE_LIFE_EXTREME	(1 << 10)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -293,6 +297,20 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 	};
 }
 
+static inline int iocb_streamid(const struct kiocb *iocb)
+{
+	if (iocb->ki_flags & IOCB_WRITE_LIFE_SHORT)
+		return WRITE_LIFE_SHORT;
+	else if (iocb->ki_flags & IOCB_WRITE_LIFE_MEDIUM)
+		return WRITE_LIFE_MEDIUM;
+	else if (iocb->ki_flags & IOCB_WRITE_LIFE_LONG)
+		return WRITE_LIFE_LONG;
+	else if (iocb->ki_flags & IOCB_WRITE_LIFE_EXTREME)
+		return WRITE_LIFE_EXTREME;
+
+	return WRITE_LIFE_UNKNOWN;
+}
+
 /*
  * "descriptor" for what we're up to with a read.
  * This allows us to use the same read code yet
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..34145b29657e 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -360,5 +360,9 @@ struct fscrypt_key {
 #define RWF_HIPRI			0x00000001 /* high priority request, poll if possible */
 #define RWF_DSYNC			0x00000002 /* per-IO O_DSYNC */
 #define RWF_SYNC			0x00000004 /* per-IO O_SYNC */
+#define RWF_WRITE_LIFE_SHORT		0x00000008 /* short life time write */
+#define RWF_WRITE_LIFE_MEDIUM		0x00000010 /* medium life time write */
+#define RWF_WRITE_LIFE_LONG		0x00000020 /* long life time write */
+#define RWF_WRITE_LIFE_EXTREME		0x00000040 /* extremely long life time write */
 
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (4 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 05/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:38   ` Andreas Dilger
  2017-06-13 17:15 ` [PATCH 07/11] fs: add support for buffered writeback to pass down " Jens Axboe
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c     | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..284b8a786283 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -227,6 +227,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 	bio.bi_iter.bi_sector = pos >> 9;
 	bio.bi_private = current;
 	bio.bi_end_io = blkdev_bio_end_io_simple;
+	bio.bi_stream = iocb_streamid(iocb);
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
@@ -360,6 +361,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		bio->bi_iter.bi_sector = pos >> 9;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
+		bio->bi_stream = iocb_streamid(iocb);
 
 		ret = bio_iov_iter_get_pages(bio, iter);
 		if (unlikely(ret)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..c9c8b9fd4329 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	bio->bi_stream = iocb_streamid(dio->iocb);
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..fa7d29632fbc 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+			bio_set_streamid(bio, inode_streamid(inode));
 			task_io_account_write(bio->bi_iter.bi_size);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/11] fs: add support for buffered writeback to pass down stream information
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (5 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:39   ` [PATCH 07/11] " Andreas Dilger
  2017-06-13 17:15 ` [PATCH 08/11] ext4: add support for passing in stream information for buffered writes Jens Axboe
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/buffer.c | 14 +++++++++-----
 fs/mpage.c  |  1 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..8324c24751ca 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-			 struct writeback_control *wbc);
+			 unsigned int stream, struct writeback_control *wbc);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+					inode_streamid(inode), wbc);
 			nr_underway++;
 		}
 		bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
-			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+					inode_streamid(inode), wbc);
 			nr_underway++;
 		}
 		bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
 }
 
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-			 struct writeback_control *wbc)
+			 unsigned int stream, struct writeback_control *wbc)
 {
 	struct bio *bio;
 
@@ -3130,6 +3132,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
 	/* Take care of bh's that straddle the end of the device */
 	guard_bio_eod(op, bio);
 
+	bio_set_streamid(bio, stream);
+
 	if (buffer_meta(bh))
 		op_flags |= REQ_META;
 	if (buffer_prio(bh))
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
 
 int submit_bh(int op, int op_flags, struct buffer_head *bh)
 {
-	return submit_bh_wbc(op, op_flags, bh, NULL);
+	return submit_bh_wbc(op, op_flags, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..a9d40c0c053e 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 			goto confused;
 
 		wbc_init_bio(wbc, bio);
+		bio_set_streamid(bio, inode_streamid(inode));
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/11] ext4: add support for passing in stream information for buffered writes
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (6 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 07/11] fs: add support for buffered writeback to pass down " Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:40   ` Andreas Dilger
  2017-06-13 17:15 ` [PATCH 09/11] xfs: " Jens Axboe
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ext4/page-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..033b5bfa4e0b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -350,6 +350,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
 		int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
 				  REQ_SYNC : 0;
 		bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
+		bio_set_streamid(io->io_bio, inode_streamid(io->io_end->inode));
 		submit_bio(io->io_bio);
 	}
 	io->io_bio = NULL;
@@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
 		ret = io_submit_init_bio(io, bh);
 		if (ret)
 			return ret;
+		bio_set_streamid(io->io_bio, inode_streamid(inode));
 	}
 	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/11] xfs: add support for passing in stream information for buffered writes
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (7 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 08/11] ext4: add support for passing in stream information for buffered writes Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:40   ` Andreas Dilger
  2017-06-13 17:15 ` [PATCH 10/11] btrfs: " Jens Axboe
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 09af0f7cd55e..9770be0140ad 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,6 +505,7 @@ xfs_submit_ioend(
 		return status;
 	}
 
+	bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
 	submit_bio(ioend->io_bio);
 	return 0;
 }
@@ -564,6 +565,7 @@ xfs_chain_bio(
 	bio_chain(ioend->io_bio, new);
 	bio_get(ioend->io_bio);		/* for xfs_destroy_ioend */
 	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+	bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
 	submit_bio(ioend->io_bio);
 	ioend->io_bio = new;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/11] btrfs: add support for passing in stream information for buffered writes
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (8 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 09/11] xfs: " Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:41   ` Andreas Dilger
  2017-06-13 17:15 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
  2017-06-13 18:04 ` [PATCH 0/11] Add support for write life time hints Andreas Dilger
  11 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/btrfs/extent_io.c | 1 +
 fs/btrfs/inode.c     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..b245085e8f10 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2827,6 +2827,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
 	bio->bi_end_io = end_io_func;
 	bio->bi_private = tree;
 	bio_set_op_attrs(bio, op, op_flags);
+	bio_set_streamid(bio, inode_streamid(page->mapping->host));
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
 		wbc_account_io(wbc, page, page_size);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..db0558a19f65 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8608,6 +8608,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	atomic_set(&dip->pending_bios, 0);
 	btrfs_bio = btrfs_io_bio(io_bio);
 	btrfs_bio->logical = file_offset;
+	bio_set_streamid(io_bio, bio_stream(dio_bio));
 
 	if (write) {
 		io_bio->bi_end_io = btrfs_endio_direct_write;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/11] nvme: add support for streams and directives
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (9 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 10/11] btrfs: " Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
  2017-06-13 19:47   ` Andreas Dilger
  2017-06-13 21:12   ` Andreas Dilger
  2017-06-13 18:04 ` [PATCH 0/11] Add support for write life time hints Andreas Dilger
  11 siblings, 2 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: Jens Axboe

This adds support for Directives in NVMe, particular for the Streams
directive. We default to allocating 4 streams per name space, but
it is configurable with the 'streams_per_ns' module option.

If a write stream is set in a write, flag is as such before
sending it to the device.

Some debug stuff in this patch, dumping streams ID params when
we load nvme.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   1 +
 include/linux/nvme.h     |  48 ++++++++++++++++++
 3 files changed, 173 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..81225e7d4176 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
 module_param(force_apst, bool, 0644);
 MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
 
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
+	if (req_op(req) == REQ_OP_WRITE) {
+		if (bio_stream_valid(req->bio) && ns->streams) {
+			unsigned stream = bio_stream(req->bio) & 0xffff;
+
+			control |= NVME_RW_DTYPE_STREAMS;
+			dsmgmt |= (stream << 16);
+		}
+	}
+
 	if (ns->ms) {
 		switch (ns->pi_type) {
 		case NVME_NS_DPS_PI_TYPE3:
@@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	return 0;
 }
 
+static int nvme_enable_streams(struct nvme_ns *ns)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_send;
+	c.directive.nsid = cpu_to_le32(ns->ns_id);
+	c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+	c.directive.dtype = NVME_DIR_IDENTIFY;
+	c.directive.tdtype = NVME_DIR_STREAMS;
+	c.directive.endir = NVME_DIR_ENDIR;
+
+	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_params(struct nvme_ns *ns)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct streams_directive_params s;
+	struct nvme_command c;
+	int ret;
+
+	memset(&c, 0, sizeof(c));
+	memset(&s, 0, sizeof(s));
+
+	c.directive.opcode = nvme_admin_directive_recv;
+	c.directive.nsid = cpu_to_le32(ns->ns_id);
+	c.directive.numd = sizeof(s);
+	c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+	c.directive.dtype = NVME_DIR_STREAMS;
+
+	ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
+	if (ret)
+		return ret;
+
+	s.msl = le16_to_cpu(s.msl);
+	s.nssa = le16_to_cpu(s.nssa);
+	s.nsso = le16_to_cpu(s.nsso);
+	s.sws = le32_to_cpu(s.sws);
+	s.sgs = le16_to_cpu(s.sgs);
+	s.nsa = le16_to_cpu(s.nsa);
+	s.nso = le16_to_cpu(s.nso);
+
+	dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
+				"sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
+				s.nsso, s.sws, s.sgs, s.nsa, s.nso);
+	return 0;
+}
+
+static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_recv;
+	c.directive.nsid = cpu_to_le32(ns->ns_id);
+	c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+	c.directive.dtype = NVME_DIR_STREAMS;
+	c.directive.endir = streams;
+
+	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_deallocate(struct nvme_ns *ns)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_send;
+	c.directive.nsid = cpu_to_le32(ns->ns_id);
+	c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
+	c.directive.dtype = NVME_DIR_STREAMS;
+
+	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static void nvme_config_streams(struct nvme_ns *ns)
+{
+	int ret;
+
+	ret = nvme_enable_streams(ns);
+	if (ret)
+		return;
+
+	ret = nvme_streams_params(ns);
+	if (ret)
+		return;
+
+	ret = nvme_streams_allocate(ns, streams_per_ns);
+	if (ret)
+		return;
+
+	ret = nvme_streams_params(ns);
+	if (ret)
+		return;
+
+	ns->streams = true;
+	dev_info(ns->ctrl->device, "successfully enabled streams\n");
+}
+
 static char nvme_pr_type(enum pr_type type)
 {
 	switch (type) {
@@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ctrl->sgls = le32_to_cpu(id->sgls);
 	ctrl->kas = le16_to_cpu(id->kas);
 
+	if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+		dev_info(ctrl->dev, "supports directives\n");
+
 	ctrl->npss = id->npss;
 	prev_apsta = ctrl->apsta;
 	if (ctrl->quirks & NVME_QUIRK_NO_APST) {
@@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		goto out_free_id;
 	}
 
+	if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+		nvme_config_streams(ns);
+
 	disk = alloc_disk_node(0, node);
 	if (!disk)
 		goto out_free_id;
@@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 					&nvme_ns_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
+		if (ns->streams)
+			nvme_streams_deallocate(ns);
 		del_gendisk(ns->disk);
 		blk_cleanup_queue(ns->queue);
 	}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..c2d8d23c90de 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -195,6 +195,7 @@ struct nvme_ns {
 	int lba_shift;
 	u16 ms;
 	bool ext;
+	bool streams;
 	u8 pi_type;
 	unsigned long flags;
 
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
 	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
@@ -295,6 +296,19 @@ enum {
 };
 
 enum {
+	NVME_DIR_IDENTIFY		= 0x00,
+	NVME_DIR_STREAMS		= 0x01,
+	NVME_DIR_SND_ID_OP_ENABLE	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_ID	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_RSC	= 0x02,
+	NVME_DIR_RCV_ID_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_STATUS	= 0x02,
+	NVME_DIR_RCV_ST_OP_RESOURCE	= 0x03,
+	NVME_DIR_ENDIR			= 0x01,
+};
+
+enum {
 	NVME_NS_FEAT_THIN	= 1 << 0,
 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
 	NVME_NS_FLBAS_META_EXT	= 0x10,
@@ -535,6 +549,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_APP	= 1 << 11,
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
+	NVME_RW_DTYPE_STREAMS		= 1 << 4,
 };
 
 struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_directive_send	= 0x19,
+	nvme_admin_directive_recv	= 0x1a,
 	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
 	__u32			rsvd14[2];
 };
 
+struct nvme_directive_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__le32			numd;
+	__u8			doper;
+	__u8			dtype;
+	__le16			dspec;
+	__u8			endir;
+	__u8			tdtype;
+	__u16			rsvd15;
+
+	__u32			rsvd16[3];
+};
+
 /*
  * Fabrics subcommands.
  */
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
 	__u32			rsvd12[6];
 };
 
+struct streams_directive_params {
+	__u16	msl;
+	__u16	nssa;
+	__u16	nsso;
+	__u8	rsvd[10];
+	__u32	sws;
+	__u16	sgs;
+	__u16	nsa;
+	__u16	nso;
+	__u8	rsvd2[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
 		struct nvme_dbbuf dbbuf;
+		struct nvme_directive_cmd directive;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
                   ` (10 preceding siblings ...)
  2017-06-13 17:15 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
@ 2017-06-13 18:04 ` Andreas Dilger
  2017-06-13 18:26   ` Jens Axboe
  11 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 18:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 1496 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> A new iteration of this patchset, previously known as write streams.
> Instead of exposing numeric values for streams, I've previously
> advocated for just doing a set of hints that makes sense instead. See
> the coverage from the LSFMM summit this year:
> 
> https://lwn.net/Articles/717755/
> 
> This patchset attempts to do that. We define 4 flags for the pwritev2
> system call:
> 
> RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
> 			a high overwrite rate, or life time.
> 
> RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT
> 
> RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM
> 
> RWF_WRITE_LIFE_EXTREME	Longer life time than LONG
> 
> The idea is that these are relative values, so an application can
> use them as they see fit. The underlying device can then place
> data appropriately, or be free to ignore the hint. It's just a hint.
> 
> Comments appreciated.

I thought that one of the major attractions of numeric stream IDs was
that they had no semantic meanings, just "N is similar to N" and "M is
similar to M", and it is up to userspace to define what these mean?

That allows userspace to use the IDs for lifetimes (as above), but
also/instead use them for allocation sizes, different applications,
different users, etc.

> A branch based on current master can be pulled from here:
> 
> git://git.kernel.dk/linux-block write-stream.1
> 
> --
> Jens Axboe
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 18:04 ` [PATCH 0/11] Add support for write life time hints Andreas Dilger
@ 2017-06-13 18:26   ` Jens Axboe
  2017-06-13 19:21     ` Andreas Dilger
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 18:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 12:04 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> A new iteration of this patchset, previously known as write streams.
>> Instead of exposing numeric values for streams, I've previously
>> advocated for just doing a set of hints that makes sense instead. See
>> the coverage from the LSFMM summit this year:
>>
>> https://lwn.net/Articles/717755/
>>
>> This patchset attempts to do that. We define 4 flags for the pwritev2
>> system call:
>>
>> RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
>> 			a high overwrite rate, or life time.
>>
>> RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT
>>
>> RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM
>>
>> RWF_WRITE_LIFE_EXTREME	Longer life time than LONG
>>
>> The idea is that these are relative values, so an application can
>> use them as they see fit. The underlying device can then place
>> data appropriately, or be free to ignore the hint. It's just a hint.
>>
>> Comments appreciated.
> 
> I thought that one of the major attractions of numeric stream IDs was
> that they had no semantic meanings, just "N is similar to N" and "M is
> similar to M", and it is up to userspace to define what these mean?
> 
> That allows userspace to use the IDs for lifetimes (as above), but
> also/instead use them for allocation sizes, different applications,
> different users, etc.

Right, that is indeed the intent. But we have to attach some naming
to them. Userspace could in theory use these totally randomly, and
things like NVMe would not care. But the semantic meaning of "short"
vs "long" is important on caching infrastructure where you might
want to use the hint for data placement.

I think the important part here is that no absolute meaning is
attached to them, only relative.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/11] block: add support for carrying stream information in a bio
  2017-06-13 17:15 ` [PATCH 01/11] block: add support for carrying stream information in a bio Jens Axboe
@ 2017-06-13 19:01   ` Andreas Dilger
  0 siblings, 0 replies; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 3201 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> No functional changes in this patch, but it allows each bio to
> carry a stream ID and disallows merging of bio's with different
> stream IDs.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> block/bio.c               |  2 ++
> block/blk-merge.c         | 14 ++++++++++++++
> include/linux/blk_types.h | 17 +++++++++++++++++
> 3 files changed, 33 insertions(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 888e7801c638..77f4be1f7428 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -595,6 +595,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> 	bio->bi_opf = bio_src->bi_opf;
> 	bio->bi_iter = bio_src->bi_iter;
> 	bio->bi_io_vec = bio_src->bi_io_vec;
> +	bio->bi_stream = bio_src->bi_stream;
> 
> 	bio_clone_blkcg_association(bio, bio_src);
> }
> @@ -678,6 +679,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
> 	bio->bi_opf		= bio_src->bi_opf;
> 	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
> 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
> +	bio->bi_stream		= bio_src->bi_stream;
> 
> 	switch (bio_op(bio)) {
> 	case REQ_OP_DISCARD:
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 3990ae406341..28998acdd675 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -693,6 +693,13 @@ static struct request *attempt_merge(struct request_queue *q,
> 		return NULL;
> 
> 	/*
> +	 * Don't allow merge of different streams, or for a stream with
> +	 * non-stream IO.
> +	 */
> +	if (req->bio->bi_stream != next->bio->bi_stream)
> +		return NULL;
> +
> +	/*
> 	 * If we are allowed to merge, then append bio list
> 	 * from next to rq and release next. merge_requests_fn
> 	 * will have updated segment counts, update sector
> @@ -811,6 +818,13 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
> 	    !blk_write_same_mergeable(rq->bio, bio))
> 		return false;
> 
> +	/*
> +	 * Don't allow merge of different streams, or for a stream with
> +	 * non-stream IO.
> +	 */
> +	if (rq->bio->bi_stream != bio->bi_stream)
> +		return false;
> +
> 	return true;
> }
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 61339bc44400..1940876a7fa7 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -36,6 +36,8 @@ struct bio {
> 	unsigned short		bi_flags;	/* status, etc and bvec pool number */
> 	unsigned short		bi_ioprio;
> 
> +	unsigned int		bi_stream;	/* write life time hint */
> +
> 	struct bvec_iter	bi_iter;
> 
> 	/* Number of segments in this BIO after
> @@ -312,4 +314,19 @@ struct blk_rq_stat {
> 	u64 batch;
> };
> 
> +static inline void bio_set_streamid(struct bio *bio, unsigned int stream)
> +{
> +	bio->bi_stream = stream;
> +}
> +
> +static inline bool bio_stream_valid(struct bio *bio)
> +{
> +	return bio->bi_stream != 0;
> +}
> +
> +static inline unsigned int bio_stream(struct bio *bio)
> +{
> +	return bio->bi_stream;
> +}
> +
> #endif /* __LINUX_BLK_TYPES_H */
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/11] block: add definition for support data write life times
  2017-06-13 17:15 ` [PATCH 02/11] block: add definition for support data write life times Jens Axboe
@ 2017-06-13 19:07   ` Andreas Dilger
  2017-06-13 20:10     ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 1390 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> include/linux/blk_types.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 1940876a7fa7..16c4e657807f 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -314,6 +314,19 @@ struct blk_rq_stat {
> 	u64 batch;
> };
> 
> +/*
> + * Data life time definitions for writes. These are relative within a
> + * given device, no absolute meaning is attached to them.
> + */
> +enum {
> +	WRITE_LIFE_UNKNOWN	= 0,
> +	WRITE_LIFE_SHORT,
> +	WRITE_LIFE_MEDIUM,
> +	WRITE_LIFE_LONG,
> +	WRITE_LIFE_EXTREME,
> +	WRITE_LIFE_NR,
> +};

With the exception of these enums, there is actually enough space in the
various structures to pass at least a 4-bit stream ID through the whole
stack (actually, a full 32-bit stream ID, except for the debugfs stats).

I'm not necessarily against having some pre-defined meanings for these
values, but with a few trivial changes to the code it would be possible
to make this functionality much more flexible.

Comments inline on each of the patches where this matters.

Cheers, Andreas

> +
> static inline void bio_set_streamid(struct bio *bio, unsigned int stream)
> {
> 	bio->bi_stream = stream;
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/11] blk-mq: expose stream write stats through debugfs
  2017-06-13 17:15 ` [PATCH 03/11] blk-mq: expose stream write stats through debugfs Jens Axboe
@ 2017-06-13 19:21   ` Andreas Dilger
  2017-06-13 20:11     ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 3375 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Useful to verify that things are working the way they should.
> Reading the file will return number of kb written to each
> stream. Writing the file will reset the statistics. No care
> is taken to ensure that we don't race on updates.

This is one of the few places where there is an arbitrary limit on the
number of stream IDs, which is strange because it doesn't correspond to
any limit in a struct field or the physical hardware or external protocol...

Bumping the stream_writes[] stats array up to 16 wouldn't make any
significant difference to the size of struct request_queue.

> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> block/blk-core.c       |  2 ++
> block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
> include/linux/blkdev.h |  2 ++
> 3 files changed, 28 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a7421b772d0e..12e402e4e741 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2057,6 +2057,8 @@ blk_qc_t generic_make_request(struct bio *bio)
> 	do {
> 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> 
> +		q->stream_writes[bio_stream(bio)] += bio->bi_iter.bi_size >> 9;

This should probably limit the value returned from bio_stream(), and put values
>= WRITE_LIFE_NR into an "overflow" bucket or similar?  At some point in the
future this could be made more flexible, based on the actual usage, but at
least the limitation on the implementation isn't something arbitrary like stats.

> 		if (likely(blk_queue_enter(q, false) == 0)) {
> 			struct bio_list lower, same;
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 803aed4d7221..a0a480a0b373 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
> 	}
> }
> 
> +static int queue_streams_show(void *data, struct seq_file *m)
> +{
> +	struct request_queue *q = data;
> +	int i;
> +
> +	for (i = 0; i < WRITE_LIFE_NR; i++)
> +		seq_printf(m, "stream%d: %llu\n", i, q->stream_writes[i]);
> +
> +	return 0;
> +}
> +
> +static ssize_t queue_streams_store(void *data, const char __user *buf,
> +				   size_t count, loff_t *ppos)
> +{
> +	struct request_queue *q = data;
> +	int i;
> +
> +	for (i = 0; i < WRITE_LIFE_NR; i++)
> +		q->stream_writes[i] = 0;
> +
> +	return count;
> +}
> +
> static int queue_poll_stat_show(void *data, struct seq_file *m)
> {
> 	struct request_queue *q = data;
> @@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
> static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
> 	{"poll_stat", 0400, queue_poll_stat_show},
> 	{"state", 0600, queue_state_show, queue_state_write},
> +	{"streams", 0600, queue_streams_show, queue_streams_store},
> 	{},
> };
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ab92c4ea138b..ab623deb20b2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -586,6 +586,8 @@ struct request_queue {
> 
> 	size_t			cmd_size;
> 	void			*rq_alloc_data;
> +
> +	u64			stream_writes[WRITE_LIFE_NR];
> };
> 
> #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 18:26   ` Jens Axboe
@ 2017-06-13 19:21     ` Andreas Dilger
  2017-06-13 20:13       ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 2323 bytes --]

On Jun 13, 2017, at 12:26 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 06/13/2017 12:04 PM, Andreas Dilger wrote:
>> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>> 
>>> A new iteration of this patchset, previously known as write streams.
>>> Instead of exposing numeric values for streams, I've previously
>>> advocated for just doing a set of hints that makes sense instead. See
>>> the coverage from the LSFMM summit this year:
>>> 
>>> https://lwn.net/Articles/717755/
>>> 
>>> This patchset attempts to do that. We define 4 flags for the pwritev2
>>> system call:
>>> 
>>> RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
>>> 			a high overwrite rate, or life time.
>>> 
>>> RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT
>>> 
>>> RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM
>>> 
>>> RWF_WRITE_LIFE_EXTREME	Longer life time than LONG
>>> 
>>> The idea is that these are relative values, so an application can
>>> use them as they see fit. The underlying device can then place
>>> data appropriately, or be free to ignore the hint. It's just a hint.
>>> 
>>> Comments appreciated.
>> 
>> I thought that one of the major attractions of numeric stream IDs was
>> that they had no semantic meanings, just "N is similar to N" and "M is
>> similar to M", and it is up to userspace to define what these mean?
>> 
>> That allows userspace to use the IDs for lifetimes (as above), but
>> also/instead use them for allocation sizes, different applications,
>> different users, etc.
> 
> Right, that is indeed the intent. But we have to attach some naming
> to them. Userspace could in theory use these totally randomly, and
> things like NVMe would not care. But the semantic meaning of "short"
> vs "long" is important on caching infrastructure where you might
> want to use the hint for data placement.
> 
> I think the important part here is that no absolute meaning is
> attached to them, only relative.

In both IOCB_WRITE_LIFE_* and RWF_WRITE_LIFE_* this is consuming 4 bits of
space (which is itself fine) for only 4 different stream IDs.  Why not just
shift a 4-bit arbitrary stream ID to the appropriate offset in those fields,
rather than treating them as 4 individual bits and allowing only one of
them to be passed down the stack at a time?

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/11] add support for an inode to carry stream related data
  2017-06-13 17:15 ` [PATCH 04/11] fs: add support for an inode to carry stream related data Jens Axboe
@ 2017-06-13 19:24   ` Andreas Dilger
  2017-06-13 20:14     ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 2169 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> No functional changes in this patch, just in preparation for
> allowing applications to pass in hints about data life times
> for writes.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

This is fine from the POV of passing the stream ID unchanged down the stack.
If we really wanted to limit the ID to a 16-bit value, the i_stream could fit
into a few 16-bit holes in struct inode, to avoid increasing its size.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> fs/inode.c         |  1 +
> include/linux/fs.h | 10 ++++++++++
> 2 files changed, 11 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index db5914783a71..c911a2b9b5a4 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -149,6 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> 	inode->i_blocks = 0;
> 	inode->i_bytes = 0;
> 	inode->i_generation = 0;
> +	inode->i_stream = WRITE_LIFE_UNKNOWN;
> 	inode->i_pipe = NULL;
> 	inode->i_bdev = NULL;
> 	inode->i_cdev = NULL;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 803e5a9b2654..bb8c246eeda8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -30,6 +30,7 @@
> #include <linux/percpu-rwsem.h>
> #include <linux/workqueue.h>
> #include <linux/delayed_call.h>
> +#include <linux/blk_types.h>
> 
> #include <asm/byteorder.h>
> #include <uapi/linux/fs.h>
> @@ -629,6 +630,7 @@ struct inode {
> #ifdef CONFIG_IMA
> 	atomic_t		i_readcount; /* struct files open RO */
> #endif
> +	unsigned int		i_stream;
> 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
> 	struct file_lock_context	*i_flctx;
> 	struct address_space	i_data;
> @@ -655,6 +657,14 @@ struct inode {
> 	void			*i_private; /* fs or device private pointer */
> };
> 
> +static inline unsigned int inode_streamid(struct inode *inode)
> +{
> +	if (inode)
> +		return inode->i_stream;
> +
> +	return 0;
> +}
> +
> static inline unsigned int i_blocksize(const struct inode *node)
> {
> 	return (1 << node->i_blkbits);
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 05/11] add support for allowing applications to pass in write life time hints
  2017-06-13 17:15 ` [PATCH 05/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-13 19:36   ` Andreas Dilger
  2017-06-13 20:21     ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 5200 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Add four flags for the pwritev2(2) system call, allowing an application
> to give the kernel a hint about what on-media life times can be
> expected from a given write.
> 
> The intent is for these values to be relative to each other, no
> absolute meaning should be attached to these flag names.
> 
> Define IOCB flags to carry this information over, and finally
> transform them into the block defined stream values.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> fs/read_write.c         | 17 ++++++++++++++++-
> include/linux/fs.h      | 18 ++++++++++++++++++
> include/uapi/linux/fs.h |  4 ++++
> 3 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 47c1d4484df9..1734728aa48b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -678,7 +678,9 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
> 	struct kiocb kiocb;
> 	ssize_t ret;
> 
> -	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
> +	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_SHORT |
> +			RWF_WRITE_LIFE_MEDIUM | RWF_WRITE_LIFE_LONG |
> +			RWF_WRITE_LIFE_EXTREME))
> 		return -EOPNOTSUPP;

Since this is essentially just a 4-bit mask, which also works fine if the stream
ID is an arbitrary value, it would be more clear to just define a mask like
RWF_WRITE_LIFE_MASK that covers these bits instead of spelling out individual bits.

> 
> 	init_sync_kiocb(&kiocb, filp);
> @@ -688,6 +690,19 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
> 		kiocb.ki_flags |= IOCB_DSYNC;
> 	if (flags & RWF_SYNC)
> 		kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> +	if (flags & RWF_WRITE_LIFE_SHORT) {
> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_SHORT;
> +		file_inode(filp)->i_stream = WRITE_LIFE_SHORT;
> +	} else if (flags & RWF_WRITE_LIFE_MEDIUM) {
> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_MEDIUM;
> +		file_inode(filp)->i_stream = WRITE_LIFE_MEDIUM;
> +	} else if (flags & RWF_WRITE_LIFE_LONG) {
> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_LONG;
> +		file_inode(filp)->i_stream = WRITE_LIFE_LONG;
> +	} else if (flags & RWF_WRITE_LIFE_EXTREME) {
> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_EXTREME;
> +		file_inode(filp)->i_stream = WRITE_LIFE_EXTREME;
> +	}

This should just pass all of the bits through:

	if (flags & RWF_WRITE_LIFE_MASK) {
		file_inode(filp)->i_stream =
			((flags & RWF_WRITE_LIFE_MASK) >> RWF_WRITE_LIFE_SHIFT;
		kiocb.ki_flags |=
			file_inode(filp)->i_stream << IOCB_WRITE_LIFE_SHIFT;
	}

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bb8c246eeda8..7f542e0b0e17 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -269,6 +269,10 @@ struct writeback_control;
> #define IOCB_DSYNC		(1 << 4)
> #define IOCB_SYNC		(1 << 5)
> #define IOCB_WRITE		(1 << 6)
> +#define IOCB_WRITE_LIFE_SHORT	(1 << 7)
> +#define IOCB_WRITE_LIFE_MEDIUM	(1 << 8)
> +#define IOCB_WRITE_LIFE_LONG	(1 << 9)
> +#define IOCB_WRITE_LIFE_EXTREME	(1 << 10)

#define IOCB_WRITE_LIFE_SHIFT	7
#define IOCB_WRITE_LIFE_MASK	(0xf << IOCB_WRITE_LIFE_SHIFT)

or if you want to make it clear which bits are actually used, either:

#define IOCB_WRITE_LIFE_MASK	((1 << 7) | (1 << 8) | (1 << 9) | (1 <<10))

or

#define IOCB_WRITE_LIFE_MASK	(BIT(7) | BIT(8) | BIT(9) | BIT(10))

> @@ -293,6 +297,20 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
> 	};
> }
> 
> +static inline int iocb_streamid(const struct kiocb *iocb)
> +{
> +	if (iocb->ki_flags & IOCB_WRITE_LIFE_SHORT)
> +		return WRITE_LIFE_SHORT;
> +	else if (iocb->ki_flags & IOCB_WRITE_LIFE_MEDIUM)
> +		return WRITE_LIFE_MEDIUM;
> +	else if (iocb->ki_flags & IOCB_WRITE_LIFE_LONG)
> +		return WRITE_LIFE_LONG;
> +	else if (iocb->ki_flags & IOCB_WRITE_LIFE_EXTREME)
> +		return WRITE_LIFE_EXTREME;
> +
> +	return WRITE_LIFE_UNKNOWN;
> +}
> +
> /*
>  * "descriptor" for what we're up to with a read.
>  * This allows us to use the same read code yet
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 24e61a54feaa..34145b29657e 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -360,5 +360,9 @@ struct fscrypt_key {
> #define RWF_HIPRI			0x00000001 /* high priority request, poll if possible */
> #define RWF_DSYNC			0x00000002 /* per-IO O_DSYNC */
> #define RWF_SYNC			0x00000004 /* per-IO O_SYNC */
> +#define RWF_WRITE_LIFE_SHORT		0x00000008 /* short life time write */
> +#define RWF_WRITE_LIFE_MEDIUM		0x00000010 /* medium life time write */
> +#define RWF_WRITE_LIFE_LONG		0x00000020 /* long life time write */
> +#define RWF_WRITE_LIFE_EXTREME		0x00000040 /* extremely long life time write */

#define RWF_WRITE_LIFE_SHIFT		3
#define RWF_WRITE_LIFE_MASK		(0xf << RWF_WRITE_LIFE_SHIFT)

or

#define RWF_WRITE_LIFE_MASK		0x00000078

Alternately, just leave a hole in the RWF flags and use

#define RWF_WRITE_LIFE_SHIFT		4
#define	RWF_WRITE_LIFE_MASK		0x000000f0

> 
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information
  2017-06-13 17:15 ` [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
@ 2017-06-13 19:38   ` Andreas Dilger
  0 siblings, 0 replies; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 2153 bytes --]


> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

This patch is itself fine, if iocb_streamid() is fixed to pass the ID
through without selecting only a single bit of the ID.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> fs/block_dev.c | 2 ++
> fs/direct-io.c | 2 ++
> fs/iomap.c     | 1 +
> 3 files changed, 5 insertions(+)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 519599dddd36..284b8a786283 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -227,6 +227,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> 	bio.bi_iter.bi_sector = pos >> 9;
> 	bio.bi_private = current;
> 	bio.bi_end_io = blkdev_bio_end_io_simple;
> +	bio.bi_stream = iocb_streamid(iocb);
> 
> 	ret = bio_iov_iter_get_pages(&bio, iter);
> 	if (unlikely(ret))
> @@ -360,6 +361,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
> 		bio->bi_iter.bi_sector = pos >> 9;
> 		bio->bi_private = dio;
> 		bio->bi_end_io = blkdev_bio_end_io;
> +		bio->bi_stream = iocb_streamid(iocb);
> 
> 		ret = bio_iov_iter_get_pages(bio, iter);
> 		if (unlikely(ret)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index a04ebea77de8..c9c8b9fd4329 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
> 	else
> 		bio->bi_end_io = dio_bio_end_io;
> 
> +	bio->bi_stream = iocb_streamid(dio->iocb);
> +
> 	sdio->bio = bio;
> 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
> }
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 4b10892967a5..fa7d29632fbc 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> 
> 		if (dio->flags & IOMAP_DIO_WRITE) {
> 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
> +			bio_set_streamid(bio, inode_streamid(inode));
> 			task_io_account_write(bio->bi_iter.bi_size);
> 		} else {
> 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/11] add support for buffered writeback to pass down stream information
  2017-06-13 17:15 ` [PATCH 07/11] fs: add support for buffered writeback to pass down " Jens Axboe
@ 2017-06-13 19:39   ` Andreas Dilger
  0 siblings, 0 replies; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 2862 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> fs/buffer.c | 14 +++++++++-----
> fs/mpage.c  |  1 +
> 2 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 161be58c5cb0..8324c24751ca 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -49,7 +49,7 @@
> 
> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
> static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> -			 struct writeback_control *wbc);
> +			 unsigned int stream, struct writeback_control *wbc);
> 
> #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
> 
> @@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
> 	do {
> 		struct buffer_head *next = bh->b_this_page;
> 		if (buffer_async_write(bh)) {
> -			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
> +			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
> +					inode_streamid(inode), wbc);
> 			nr_underway++;
> 		}
> 		bh = next;
> @@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
> 		struct buffer_head *next = bh->b_this_page;
> 		if (buffer_async_write(bh)) {
> 			clear_buffer_dirty(bh);
> -			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
> +			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
> +					inode_streamid(inode), wbc);
> 			nr_underway++;
> 		}
> 		bh = next;
> @@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
> }
> 
> static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> -			 struct writeback_control *wbc)
> +			 unsigned int stream, struct writeback_control *wbc)
> {
> 	struct bio *bio;
> 
> @@ -3130,6 +3132,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> 	/* Take care of bh's that straddle the end of the device */
> 	guard_bio_eod(op, bio);
> 
> +	bio_set_streamid(bio, stream);
> +
> 	if (buffer_meta(bh))
> 		op_flags |= REQ_META;
> 	if (buffer_prio(bh))
> @@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> 
> int submit_bh(int op, int op_flags, struct buffer_head *bh)
> {
> -	return submit_bh_wbc(op, op_flags, bh, NULL);
> +	return submit_bh_wbc(op, op_flags, bh, 0, NULL);
> }
> EXPORT_SYMBOL(submit_bh);
> 
> diff --git a/fs/mpage.c b/fs/mpage.c
> index baff8f820c29..a9d40c0c053e 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
> 			goto confused;
> 
> 		wbc_init_bio(wbc, bio);
> +		bio_set_streamid(bio, inode_streamid(inode));
> 	}
> 
> 	/*
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/11] ext4: add support for passing in stream information for buffered writes
  2017-06-13 17:15 ` [PATCH 08/11] ext4: add support for passing in stream information for buffered writes Jens Axboe
@ 2017-06-13 19:40   ` Andreas Dilger
  0 siblings, 0 replies; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 1061 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> fs/ext4/page-io.c | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 1a82138ba739..033b5bfa4e0b 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -350,6 +350,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
> 		int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
> 				  REQ_SYNC : 0;
> 		bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
> +		bio_set_streamid(io->io_bio, inode_streamid(io->io_end->inode));
> 		submit_bio(io->io_bio);
> 	}
> 	io->io_bio = NULL;
> @@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
> 		ret = io_submit_init_bio(io, bh);
> 		if (ret)
> 			return ret;
> +		bio_set_streamid(io->io_bio, inode_streamid(inode));
> 	}
> 	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
> 	if (ret != bh->b_size)
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/11] xfs: add support for passing in stream information for buffered writes
  2017-06-13 17:15 ` [PATCH 09/11] xfs: " Jens Axboe
@ 2017-06-13 19:40   ` Andreas Dilger
  0 siblings, 0 replies; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 927 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> fs/xfs/xfs_aops.c | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 09af0f7cd55e..9770be0140ad 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -505,6 +505,7 @@ xfs_submit_ioend(
> 		return status;
> 	}
> 
> +	bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
> 	submit_bio(ioend->io_bio);
> 	return 0;
> }
> @@ -564,6 +565,7 @@ xfs_chain_bio(
> 	bio_chain(ioend->io_bio, new);
> 	bio_get(ioend->io_bio);		/* for xfs_destroy_ioend */
> 	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> +	bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
> 	submit_bio(ioend->io_bio);
> 	ioend->io_bio = new;
> }
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/11] btrfs: add support for passing in stream information for buffered writes
  2017-06-13 17:15 ` [PATCH 10/11] btrfs: " Jens Axboe
@ 2017-06-13 19:41   ` Andreas Dilger
  0 siblings, 0 replies; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 1337 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> ---
> fs/btrfs/extent_io.c | 1 +
> fs/btrfs/inode.c     | 1 +
> 2 files changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index d3619e010005..b245085e8f10 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2827,6 +2827,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
> 	bio->bi_end_io = end_io_func;
> 	bio->bi_private = tree;
> 	bio_set_op_attrs(bio, op, op_flags);
> +	bio_set_streamid(bio, inode_streamid(page->mapping->host));
> 	if (wbc) {
> 		wbc_init_bio(wbc, bio);
> 		wbc_account_io(wbc, page, page_size);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index ef3c98c527c1..db0558a19f65 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8608,6 +8608,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
> 	atomic_set(&dip->pending_bios, 0);
> 	btrfs_bio = btrfs_io_bio(io_bio);
> 	btrfs_bio->logical = file_offset;
> +	bio_set_streamid(io_bio, bio_stream(dio_bio));
> 
> 	if (write) {
> 		io_bio->bi_end_io = btrfs_endio_direct_write;
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] nvme: add support for streams and directives
  2017-06-13 17:15 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
@ 2017-06-13 19:47   ` Andreas Dilger
  2017-06-13 20:25     ` Jens Axboe
  2017-06-13 21:12   ` Andreas Dilger
  1 sibling, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 9843 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> This adds support for Directives in NVMe, particular for the Streams
> directive. We default to allocating 4 streams per name space, but
> it is configurable with the 'streams_per_ns' module option.
> 
> If a write stream is set in a write, flag is as such before
> sending it to the device.
> 
> Some debug stuff in this patch, dumping streams ID params when
> we load nvme.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h |   1 +
> include/linux/nvme.h     |  48 ++++++++++++++++++
> 3 files changed, 173 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 903d5813023a..81225e7d4176 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -65,6 +65,10 @@ static bool force_apst;
> module_param(force_apst, bool, 0644);
> MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
> 
> +static char streams_per_ns = 4;
> +module_param(streams_per_ns, byte, 0644);
> +MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");

Are there any limits here?  For example, has to be a power-of-two value, or maximum
number per namespace, per device, etc?

> static LIST_HEAD(nvme_ctrl_list);
> static DEFINE_SPINLOCK(dev_list_lock);
> 
> @@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
> 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
> 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
> 
> +	if (req_op(req) == REQ_OP_WRITE) {
> +		if (bio_stream_valid(req->bio) && ns->streams) {
> +			unsigned stream = bio_stream(req->bio) & 0xffff;
> +
> +			control |= NVME_RW_DTYPE_STREAMS;
> +			dsmgmt |= (stream << 16);

It isn't really clear how this is converted from the 2^16 stream IDs masked
out of 2^32 possible streams in the bio into streams_per_ns = 4 streams per
NVMe device namespace.  This appears to be implicitly dependent on upper
layers submitting only 4 distinct WRITE_FILE_* stream IDs, but IMHO that
should be handled transparently at this level.  There isn't really any value
to mask the returned bio_stream with 0xffff, since either it will internally
be truncated by streams_per_ns (in which case we don't need to do anything),
or it should be explicitly folded into the accepted range, like:

			dsmgmt |= (bio_stream(req->bio) % streams_per_ns) << 16;

or if we want to avoid a 32-bit modulus here, we could pre-compute a mask from
streams_per_ns and enforce that this be a power-of-two value?

Cheers, Andreas

> +		}
> +	}
> +
> 	if (ns->ms) {
> 		switch (ns->pi_type) {
> 		case NVME_NS_DPS_PI_TYPE3:
> @@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
> 	return 0;
> }
> 
> +static int nvme_enable_streams(struct nvme_ns *ns)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +
> +	c.directive.opcode = nvme_admin_directive_send;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
> +	c.directive.dtype = NVME_DIR_IDENTIFY;
> +	c.directive.tdtype = NVME_DIR_STREAMS;
> +	c.directive.endir = NVME_DIR_ENDIR;
> +
> +	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_params(struct nvme_ns *ns)
> +{
> +	struct nvme_ctrl *ctrl = ns->ctrl;
> +	struct streams_directive_params s;
> +	struct nvme_command c;
> +	int ret;
> +
> +	memset(&c, 0, sizeof(c));
> +	memset(&s, 0, sizeof(s));
> +
> +	c.directive.opcode = nvme_admin_directive_recv;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.numd = sizeof(s);
> +	c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
> +	c.directive.dtype = NVME_DIR_STREAMS;
> +
> +	ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
> +	if (ret)
> +		return ret;
> +
> +	s.msl = le16_to_cpu(s.msl);
> +	s.nssa = le16_to_cpu(s.nssa);
> +	s.nsso = le16_to_cpu(s.nsso);
> +	s.sws = le32_to_cpu(s.sws);
> +	s.sgs = le16_to_cpu(s.sgs);
> +	s.nsa = le16_to_cpu(s.nsa);
> +	s.nso = le16_to_cpu(s.nso);
> +
> +	dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
> +				"sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
> +				s.nsso, s.sws, s.sgs, s.nsa, s.nso);
> +	return 0;
> +}
> +
> +static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +
> +	c.directive.opcode = nvme_admin_directive_recv;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
> +	c.directive.dtype = NVME_DIR_STREAMS;
> +	c.directive.endir = streams;
> +
> +	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_deallocate(struct nvme_ns *ns)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +
> +	c.directive.opcode = nvme_admin_directive_send;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
> +	c.directive.dtype = NVME_DIR_STREAMS;
> +
> +	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static void nvme_config_streams(struct nvme_ns *ns)
> +{
> +	int ret;
> +
> +	ret = nvme_enable_streams(ns);
> +	if (ret)
> +		return;
> +
> +	ret = nvme_streams_params(ns);
> +	if (ret)
> +		return;
> +
> +	ret = nvme_streams_allocate(ns, streams_per_ns);
> +	if (ret)
> +		return;
> +
> +	ret = nvme_streams_params(ns);
> +	if (ret)
> +		return;
> +
> +	ns->streams = true;
> +	dev_info(ns->ctrl->device, "successfully enabled streams\n");
> +}
> +
> static char nvme_pr_type(enum pr_type type)
> {
> 	switch (type) {
> @@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> 	ctrl->sgls = le32_to_cpu(id->sgls);
> 	ctrl->kas = le16_to_cpu(id->kas);
> 
> +	if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> +		dev_info(ctrl->dev, "supports directives\n");
> +
> 	ctrl->npss = id->npss;
> 	prev_apsta = ctrl->apsta;
> 	if (ctrl->quirks & NVME_QUIRK_NO_APST) {
> @@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> 		goto out_free_id;
> 	}
> 
> +	if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> +		nvme_config_streams(ns);
> +
> 	disk = alloc_disk_node(0, node);
> 	if (!disk)
> 		goto out_free_id;
> @@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> 					&nvme_ns_attr_group);
> 		if (ns->ndev)
> 			nvme_nvm_unregister_sysfs(ns);
> +		if (ns->streams)
> +			nvme_streams_deallocate(ns);
> 		del_gendisk(ns->disk);
> 		blk_cleanup_queue(ns->queue);
> 	}
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9d6a070d4391..c2d8d23c90de 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -195,6 +195,7 @@ struct nvme_ns {
> 	int lba_shift;
> 	u16 ms;
> 	bool ext;
> +	bool streams;
> 	u8 pi_type;
> 	unsigned long flags;
> 
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index b625bacf37ef..8b2f5b140134 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -245,6 +245,7 @@ enum {
> 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
> 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
> 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
> +	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
> 	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
> };
> 
> @@ -295,6 +296,19 @@ enum {
> };
> 
> enum {
> +	NVME_DIR_IDENTIFY		= 0x00,
> +	NVME_DIR_STREAMS		= 0x01,
> +	NVME_DIR_SND_ID_OP_ENABLE	= 0x01,
> +	NVME_DIR_SND_ST_OP_REL_ID	= 0x01,
> +	NVME_DIR_SND_ST_OP_REL_RSC	= 0x02,
> +	NVME_DIR_RCV_ID_OP_PARAM	= 0x01,
> +	NVME_DIR_RCV_ST_OP_PARAM	= 0x01,
> +	NVME_DIR_RCV_ST_OP_STATUS	= 0x02,
> +	NVME_DIR_RCV_ST_OP_RESOURCE	= 0x03,
> +	NVME_DIR_ENDIR			= 0x01,
> +};
> +
> +enum {
> 	NVME_NS_FEAT_THIN	= 1 << 0,
> 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
> 	NVME_NS_FLBAS_META_EXT	= 0x10,
> @@ -535,6 +549,7 @@ enum {
> 	NVME_RW_PRINFO_PRCHK_APP	= 1 << 11,
> 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
> 	NVME_RW_PRINFO_PRACT		= 1 << 13,
> +	NVME_RW_DTYPE_STREAMS		= 1 << 4,
> };
> 
> struct nvme_dsm_cmd {
> @@ -604,6 +619,8 @@ enum nvme_admin_opcode {
> 	nvme_admin_download_fw		= 0x11,
> 	nvme_admin_ns_attach		= 0x15,
> 	nvme_admin_keep_alive		= 0x18,
> +	nvme_admin_directive_send	= 0x19,
> +	nvme_admin_directive_recv	= 0x1a,
> 	nvme_admin_dbbuf		= 0x7C,
> 	nvme_admin_format_nvm		= 0x80,
> 	nvme_admin_security_send	= 0x81,
> @@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
> 	__u32			rsvd14[2];
> };
> 
> +struct nvme_directive_cmd {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__le32			nsid;
> +	__u64			rsvd2[2];
> +	union nvme_data_ptr	dptr;
> +	__le32			numd;
> +	__u8			doper;
> +	__u8			dtype;
> +	__le16			dspec;
> +	__u8			endir;
> +	__u8			tdtype;
> +	__u16			rsvd15;
> +
> +	__u32			rsvd16[3];
> +};
> +
> /*
>  * Fabrics subcommands.
>  */
> @@ -886,6 +921,18 @@ struct nvme_dbbuf {
> 	__u32			rsvd12[6];
> };
> 
> +struct streams_directive_params {
> +	__u16	msl;
> +	__u16	nssa;
> +	__u16	nsso;
> +	__u8	rsvd[10];
> +	__u32	sws;
> +	__u16	sgs;
> +	__u16	nsa;
> +	__u16	nso;
> +	__u8	rsvd2[6];
> +};
> +
> struct nvme_command {
> 	union {
> 		struct nvme_common_command common;
> @@ -906,6 +953,7 @@ struct nvme_command {
> 		struct nvmf_property_set_command prop_set;
> 		struct nvmf_property_get_command prop_get;
> 		struct nvme_dbbuf dbbuf;
> +		struct nvme_directive_cmd directive;
> 	};
> };
> 
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/11] block: add definition for support data write life times
  2017-06-13 19:07   ` Andreas Dilger
@ 2017-06-13 20:10     ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 20:10 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 01:07 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>> include/linux/blk_types.h | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 1940876a7fa7..16c4e657807f 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -314,6 +314,19 @@ struct blk_rq_stat {
>> 	u64 batch;
>> };
>>
>> +/*
>> + * Data life time definitions for writes. These are relative within a
>> + * given device, no absolute meaning is attached to them.
>> + */
>> +enum {
>> +	WRITE_LIFE_UNKNOWN	= 0,
>> +	WRITE_LIFE_SHORT,
>> +	WRITE_LIFE_MEDIUM,
>> +	WRITE_LIFE_LONG,
>> +	WRITE_LIFE_EXTREME,
>> +	WRITE_LIFE_NR,
>> +};
> 
> With the exception of these enums, there is actually enough space in the
> various structures to pass at least a 4-bit stream ID through the whole
> stack (actually, a full 32-bit stream ID, except for the debugfs stats).
> 
> I'm not necessarily against having some pre-defined meanings for these
> values, but with a few trivial changes to the code it would be possible
> to make this functionality much more flexible.

One thing I'm envisioning, down the line, is to have sets of streams. We
have the four streams, should be good enough for any app. But for multiple
apps, it becomes more difficult. So we could essentially shift the stream
ID up and share this space, and also utilize more streams on the back end
for that. For software caching solution, they can already factor in the
application, but for hardware devices, we could make use of that.

So I don't think there's anything wrong with the current approach. The
whole point was not to over-complicate or over-engineer this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/11] blk-mq: expose stream write stats through debugfs
  2017-06-13 19:21   ` Andreas Dilger
@ 2017-06-13 20:11     ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 20:11 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 01:21 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Useful to verify that things are working the way they should.
>> Reading the file will return number of kb written to each
>> stream. Writing the file will reset the statistics. No care
>> is taken to ensure that we don't race on updates.
> 
> This is one of the few places where there is an arbitrary limit on the
> number of stream IDs, which is strange because it doesn't correspond to
> any limit in a struct field or the physical hardware or external protocol...
> 
> Bumping the stream_writes[] stats array up to 16 wouldn't make any
> significant difference to the size of struct request_queue.

For my testing, it was actually 16. Until we enable the use of more
than 4 streams, I think we should limit it to just 4. But yes, cheap
to change.

>> @@ -2057,6 +2057,8 @@ blk_qc_t generic_make_request(struct bio *bio)
>> 	do {
>> 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>
>> +		q->stream_writes[bio_stream(bio)] += bio->bi_iter.bi_size >> 9;
> 
> This should probably limit the value returned from bio_stream(), and put values
>> = WRITE_LIFE_NR into an "overflow" bucket or similar?  At some point in the
> future this could be made more flexible, based on the actual usage, but at
> least the limitation on the implementation isn't something arbitrary like stats.

Agree, we need to either check here or ensure that bio_stream() never returns
anything above WRITE_LIFE_EXTREME. I'll make that change.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 19:21     ` Andreas Dilger
@ 2017-06-13 20:13       ` Jens Axboe
  2017-06-13 20:45         ` Andreas Dilger
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 20:13 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 01:21 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 12:26 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 06/13/2017 12:04 PM, Andreas Dilger wrote:
>>> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> A new iteration of this patchset, previously known as write streams.
>>>> Instead of exposing numeric values for streams, I've previously
>>>> advocated for just doing a set of hints that makes sense instead. See
>>>> the coverage from the LSFMM summit this year:
>>>>
>>>> https://lwn.net/Articles/717755/
>>>>
>>>> This patchset attempts to do that. We define 4 flags for the pwritev2
>>>> system call:
>>>>
>>>> RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
>>>> 			a high overwrite rate, or life time.
>>>>
>>>> RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT
>>>>
>>>> RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM
>>>>
>>>> RWF_WRITE_LIFE_EXTREME	Longer life time than LONG
>>>>
>>>> The idea is that these are relative values, so an application can
>>>> use them as they see fit. The underlying device can then place
>>>> data appropriately, or be free to ignore the hint. It's just a hint.
>>>>
>>>> Comments appreciated.
>>>
>>> I thought that one of the major attractions of numeric stream IDs was
>>> that they had no semantic meanings, just "N is similar to N" and "M is
>>> similar to M", and it is up to userspace to define what these mean?
>>>
>>> That allows userspace to use the IDs for lifetimes (as above), but
>>> also/instead use them for allocation sizes, different applications,
>>> different users, etc.
>>
>> Right, that is indeed the intent. But we have to attach some naming
>> to them. Userspace could in theory use these totally randomly, and
>> things like NVMe would not care. But the semantic meaning of "short"
>> vs "long" is important on caching infrastructure where you might
>> want to use the hint for data placement.
>>
>> I think the important part here is that no absolute meaning is
>> attached to them, only relative.
> 
> In both IOCB_WRITE_LIFE_* and RWF_WRITE_LIFE_* this is consuming 4 bits of
> space (which is itself fine) for only 4 different stream IDs.  Why not just
> shift a 4-bit arbitrary stream ID to the appropriate offset in those fields,
> rather than treating them as 4 individual bits and allowing only one of
> them to be passed down the stack at a time?

I did think about that, and I'm a bit split on it. It turns a bit mask
into a hybrid beast, with bits and sets of bits for values.

For utilization of the space, yes, we could just use 2 bits instead of
the 4. Or use the 4 bits and potentially have the app pass in up to 16
values. For the latter, I'm still very much in favor of keeping the app
interface super simple and just retaining the 4 life time types.

If folks feel strongly about the wasted space, and I can definitely
revisit and just pack it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/11] add support for an inode to carry stream related data
  2017-06-13 19:24   ` [PATCH 04/11] " Andreas Dilger
@ 2017-06-13 20:14     ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 20:14 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 01:24 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> No functional changes in this patch, just in preparation for
>> allowing applications to pass in hints about data life times
>> for writes.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> This is fine from the POV of passing the stream ID unchanged down the stack.
> If we really wanted to limit the ID to a 16-bit value, the i_stream could fit
> into a few 16-bit holes in struct inode, to avoid increasing its size.

Thanks, I'll change it to a short and pack the hole instead. Even with
everything else in there, having 64k streams is more than we need, might
as well do that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 05/11] add support for allowing applications to pass in write life time hints
  2017-06-13 19:36   ` [PATCH 05/11] " Andreas Dilger
@ 2017-06-13 20:21     ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 20:21 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 01:36 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Add four flags for the pwritev2(2) system call, allowing an application
>> to give the kernel a hint about what on-media life times can be
>> expected from a given write.
>>
>> The intent is for these values to be relative to each other, no
>> absolute meaning should be attached to these flag names.
>>
>> Define IOCB flags to carry this information over, and finally
>> transform them into the block defined stream values.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>> fs/read_write.c         | 17 ++++++++++++++++-
>> include/linux/fs.h      | 18 ++++++++++++++++++
>> include/uapi/linux/fs.h |  4 ++++
>> 3 files changed, 38 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 47c1d4484df9..1734728aa48b 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -678,7 +678,9 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
>> 	struct kiocb kiocb;
>> 	ssize_t ret;
>>
>> -	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
>> +	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_SHORT |
>> +			RWF_WRITE_LIFE_MEDIUM | RWF_WRITE_LIFE_LONG |
>> +			RWF_WRITE_LIFE_EXTREME))
>> 		return -EOPNOTSUPP;
> 
> Since this is essentially just a 4-bit mask, which also works fine if the stream
> ID is an arbitrary value, it would be more clear to just define a mask like
> RWF_WRITE_LIFE_MASK that covers these bits instead of spelling out individual bits.

I almost defined a RWF_VALID_FLAGS mask to cover them all. How's that? I don't think
we should split these out separately, if we don't have to.

>> @@ -688,6 +690,19 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
>> 		kiocb.ki_flags |= IOCB_DSYNC;
>> 	if (flags & RWF_SYNC)
>> 		kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
>> +	if (flags & RWF_WRITE_LIFE_SHORT) {
>> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_SHORT;
>> +		file_inode(filp)->i_stream = WRITE_LIFE_SHORT;
>> +	} else if (flags & RWF_WRITE_LIFE_MEDIUM) {
>> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_MEDIUM;
>> +		file_inode(filp)->i_stream = WRITE_LIFE_MEDIUM;
>> +	} else if (flags & RWF_WRITE_LIFE_LONG) {
>> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_LONG;
>> +		file_inode(filp)->i_stream = WRITE_LIFE_LONG;
>> +	} else if (flags & RWF_WRITE_LIFE_EXTREME) {
>> +		kiocb.ki_flags |= IOCB_WRITE_LIFE_EXTREME;
>> +		file_inode(filp)->i_stream = WRITE_LIFE_EXTREME;
>> +	}
> 
> This should just pass all of the bits through:
> 
> 	if (flags & RWF_WRITE_LIFE_MASK) {
> 		file_inode(filp)->i_stream =
> 			((flags & RWF_WRITE_LIFE_MASK) >> RWF_WRITE_LIFE_SHIFT;
> 		kiocb.ki_flags |=
> 			file_inode(filp)->i_stream << IOCB_WRITE_LIFE_SHIFT;
> 	}

Agree, that's the nice benefit of rolling them into one, it'll make the code
more efficient too. OK, let me get that done...


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] nvme: add support for streams and directives
  2017-06-13 19:47   ` Andreas Dilger
@ 2017-06-13 20:25     ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 20:25 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 01:47 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> This adds support for Directives in NVMe, particular for the Streams
>> directive. We default to allocating 4 streams per name space, but
>> it is configurable with the 'streams_per_ns' module option.
>>
>> If a write stream is set in a write, flag is as such before
>> sending it to the device.
>>
>> Some debug stuff in this patch, dumping streams ID params when
>> we load nvme.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>> drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h |   1 +
>> include/linux/nvme.h     |  48 ++++++++++++++++++
>> 3 files changed, 173 insertions(+)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 903d5813023a..81225e7d4176 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -65,6 +65,10 @@ static bool force_apst;
>> module_param(force_apst, bool, 0644);
>> MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
>>
>> +static char streams_per_ns = 4;
>> +module_param(streams_per_ns, byte, 0644);
>> +MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
> 
> Are there any limits here?  For example, has to be a power-of-two value, or maximum
> number per namespace, per device, etc?

It doesn't have to be a power-of-two. Basically the device has X number of
streams available, and you can assign 1..X to a name space. Since we used
four for the API, I just open four here. It's mostly for testing, in reality
we should probably just split the available streams between namespaces and
be done with it. Either we want to use streams, and then all of them, or none
at all.

>> @@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
>> 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
>> 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>>
>> +	if (req_op(req) == REQ_OP_WRITE) {
>> +		if (bio_stream_valid(req->bio) && ns->streams) {
>> +			unsigned stream = bio_stream(req->bio) & 0xffff;
>> +
>> +			control |= NVME_RW_DTYPE_STREAMS;
>> +			dsmgmt |= (stream << 16);
> 
> It isn't really clear how this is converted from the 2^16 stream IDs masked
> out of 2^32 possible streams in the bio into streams_per_ns = 4 streams per
> NVMe device namespace.  This appears to be implicitly dependent on upper
> layers submitting only 4 distinct WRITE_FILE_* stream IDs, but IMHO that
> should be handled transparently at this level.  There isn't really any value
> to mask the returned bio_stream with 0xffff, since either it will internally
> be truncated by streams_per_ns (in which case we don't need to do anything),
> or it should be explicitly folded into the accepted range, like:
> 
> 			dsmgmt |= (bio_stream(req->bio) % streams_per_ns) << 16;
> 
> or if we want to avoid a 32-bit modulus here, we could pre-compute a mask from
> streams_per_ns and enforce that this be a power-of-two value?

Right, and I think we should handle this in the block layer. Have the device
flag how many streams it supports, and just multiplex if we have to. Then the
device can rely on the fact that it never receives a stream ID that's larger
than it supports. Or just MOD it with streams_per_ns like you suggest. I'll
do the latter for now.

The 0xffff mask is typically done in drivers to inform the reader that the
spec has 16-bits set aside for this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 20:13       ` Jens Axboe
@ 2017-06-13 20:45         ` Andreas Dilger
  2017-06-13 20:56           ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 20:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 3957 bytes --]

On Jun 13, 2017, at 2:13 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 06/13/2017 01:21 PM, Andreas Dilger wrote:
>> On Jun 13, 2017, at 12:26 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>> 
>>> On 06/13/2017 12:04 PM, Andreas Dilger wrote:
>>>> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>> 
>>>>> A new iteration of this patchset, previously known as write streams.
>>>>> Instead of exposing numeric values for streams, I've previously
>>>>> advocated for just doing a set of hints that makes sense instead. See
>>>>> the coverage from the LSFMM summit this year:
>>>>> 
>>>>> https://lwn.net/Articles/717755/
>>>>> 
>>>>> This patchset attempts to do that. We define 4 flags for the pwritev2
>>>>> system call:
>>>>> 
>>>>> RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
>>>>> 			a high overwrite rate, or life time.
>>>>> 
>>>>> RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT
>>>>> 
>>>>> RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM
>>>>> 
>>>>> RWF_WRITE_LIFE_EXTREME	Longer life time than LONG
>>>>> 
>>>>> The idea is that these are relative values, so an application can
>>>>> use them as they see fit. The underlying device can then place
>>>>> data appropriately, or be free to ignore the hint. It's just a hint.
>>>>> 
>>>>> Comments appreciated.
>>>> 
>>>> I thought that one of the major attractions of numeric stream IDs was
>>>> that they had no semantic meanings, just "N is similar to N" and "M is
>>>> similar to M", and it is up to userspace to define what these mean?
>>>> 
>>>> That allows userspace to use the IDs for lifetimes (as above), but
>>>> also/instead use them for allocation sizes, different applications,
>>>> different users, etc.
>>> 
>>> Right, that is indeed the intent. But we have to attach some naming
>>> to them. Userspace could in theory use these totally randomly, and
>>> things like NVMe would not care. But the semantic meaning of "short"
>>> vs "long" is important on caching infrastructure where you might
>>> want to use the hint for data placement.
>>> 
>>> I think the important part here is that no absolute meaning is
>>> attached to them, only relative.
>> 
>> In both IOCB_WRITE_LIFE_* and RWF_WRITE_LIFE_* this is consuming 4 bits of
>> space (which is itself fine) for only 4 different stream IDs.  Why not just
>> shift a 4-bit arbitrary stream ID to the appropriate offset in those fields,
>> rather than treating them as 4 individual bits and allowing only one of
>> them to be passed down the stack at a time?
> 
> I did think about that, and I'm a bit split on it. It turns a bit mask
> into a hybrid beast, with bits and sets of bits for values.

I don't think that is too confusing for anyone.

> For utilization of the space, yes, we could just use 2 bits instead of
> the 4. Or use the 4 bits and potentially have the app pass in up to 16
> values. For the latter, I'm still very much in favor of keeping the app
> interface super simple and just retaining the 4 life time types.

I don't think anyone cares too much about 4 bits.  Strictly speaking,
the current implementation can't fit into 2 bits because it has 5 values
if one includes "no ID".

It is a bit of overhead to use 32 bits in most of the structs where this
field is actually stored.  I suspect that using only 8 or 16 bits in the
structs is better, and even if it doesn't find an existing hole ("pahole"
is your friend here) it will allow something else to use the remaining
space in the future.

For userspace, I think it is fine to define the WRITE_LIFE values as they
are today, and most users will use them, but IMHO it makes sense to
allow arbitrary IDs as they see fit, especially if the underlying hardware
doesn't care much about the actual values.

> If folks feel strongly about the wasted space, and I can definitely
> revisit and just pack it.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 20:45         ` Andreas Dilger
@ 2017-06-13 20:56           ` Jens Axboe
  2017-06-13 21:53             ` Andreas Dilger
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 20:56 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 02:45 PM, Andreas Dilger wrote:
>>>>> I thought that one of the major attractions of numeric stream IDs was
>>>>> that they had no semantic meanings, just "N is similar to N" and "M is
>>>>> similar to M", and it is up to userspace to define what these mean?
>>>>>
>>>>> That allows userspace to use the IDs for lifetimes (as above), but
>>>>> also/instead use them for allocation sizes, different applications,
>>>>> different users, etc.
>>>>
>>>> Right, that is indeed the intent. But we have to attach some naming
>>>> to them. Userspace could in theory use these totally randomly, and
>>>> things like NVMe would not care. But the semantic meaning of "short"
>>>> vs "long" is important on caching infrastructure where you might
>>>> want to use the hint for data placement.
>>>>
>>>> I think the important part here is that no absolute meaning is
>>>> attached to them, only relative.
>>>
>>> In both IOCB_WRITE_LIFE_* and RWF_WRITE_LIFE_* this is consuming 4 bits of
>>> space (which is itself fine) for only 4 different stream IDs.  Why not just
>>> shift a 4-bit arbitrary stream ID to the appropriate offset in those fields,
>>> rather than treating them as 4 individual bits and allowing only one of
>>> them to be passed down the stack at a time?
>>
>> I did think about that, and I'm a bit split on it. It turns a bit mask
>> into a hybrid beast, with bits and sets of bits for values.
> 
> I don't think that is too confusing for anyone.

Agree, especially because it's a win on multiple fronts. I've made that
change (see the write-stream.2 branch).

>> For utilization of the space, yes, we could just use 2 bits instead of
>> the 4. Or use the 4 bits and potentially have the app pass in up to 16
>> values. For the latter, I'm still very much in favor of keeping the app
>> interface super simple and just retaining the 4 life time types.
> 
> I don't think anyone cares too much about 4 bits.  Strictly speaking,
> the current implementation can't fit into 2 bits because it has 5 values
> if one includes "no ID".

Right, I ended up taking 3 bits. That allows 7 valid stream IDs, and one
for unknown.

> It is a bit of overhead to use 32 bits in most of the structs where this
> field is actually stored.  I suspect that using only 8 or 16 bits in the
> structs is better, and even if it doesn't find an existing hole ("pahole"
> is your friend here) it will allow something else to use the remaining
> space in the future.

I packed the i_stream into a hole, and the bio part is filling a hole
too. So while that's not completely free, it's better than taking more
space.

> For userspace, I think it is fine to define the WRITE_LIFE values as they
> are today, and most users will use them, but IMHO it makes sense to
> allow arbitrary IDs as they see fit, especially if the underlying hardware
> doesn't care much about the actual values.

I change the WRITE_LIFT values to just be 1, 2, 3, 4, but shifted up.
Might as well keep them as values, now we've changed the space in the
flags field.

I'll throw this out for more commenting soon. Meanwhile, I'd appreciate
if you could just take a look at the write-stream.2 branch:

http://git.kernel.dk/cgit/linux-block/log/?h=write-stream.2

as it should address all your concerns so far. Thanks! Until I post it,
I may rebase it... I'm going to throw the basic testing at it to ensure
it still works as designed before posting again.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] nvme: add support for streams and directives
  2017-06-13 17:15 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
  2017-06-13 19:47   ` Andreas Dilger
@ 2017-06-13 21:12   ` Andreas Dilger
  2017-06-13 21:18     ` Jens Axboe
  1 sibling, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 21:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 8513 bytes --]

On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> This adds support for Directives in NVMe, particular for the Streams
> directive. We default to allocating 4 streams per name space, but
> it is configurable with the 'streams_per_ns' module option.
> 
> If a write stream is set in a write, flag is as such before
> sending it to the device.
> 
> Some debug stuff in this patch, dumping streams ID params when
> we load nvme.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h |   1 +
> include/linux/nvme.h     |  48 ++++++++++++++++++
> 3 files changed, 173 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 903d5813023a..81225e7d4176 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
> 	return 0;
> }
> 
> +static int nvme_enable_streams(struct nvme_ns *ns)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +
> +	c.directive.opcode = nvme_admin_directive_send;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
> +	c.directive.dtype = NVME_DIR_IDENTIFY;
> +	c.directive.tdtype = NVME_DIR_STREAMS;
> +	c.directive.endir = NVME_DIR_ENDIR;
> +
> +	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_params(struct nvme_ns *ns)
> +{
> +	struct nvme_ctrl *ctrl = ns->ctrl;
> +	struct streams_directive_params s;
> +	struct nvme_command c;
> +	int ret;
> +
> +	memset(&c, 0, sizeof(c));
> +	memset(&s, 0, sizeof(s));

Minor nit in a few places in this code - is it more efficient to initialize
these structs at declaration time instead of calling memset(), like:

	struct streams_directive_params s = {};
	struct nvme_command c = {};

or alternately initializing them fully at declaration time to avoid zeroing
fields that are immediately being set:

	struct nvme_command c =
	{
		.directive.opcode = nvme_admin_directive_recv;
		.directive.nsid = cpu_to_le32(ns->ns_id);
		.directive.numd = sizeof(s);
		.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
		.directive.dtype = NVME_DIR_STREAMS;
	};

Same for the other functions here.

Cheers, Andreas

> +
> +	c.directive.opcode = nvme_admin_directive_recv;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.numd = sizeof(s);
> +	c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
> +	c.directive.dtype = NVME_DIR_STREAMS;
> +
> +	ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
> +	if (ret)
> +		return ret;
> +
> +	s.msl = le16_to_cpu(s.msl);
> +	s.nssa = le16_to_cpu(s.nssa);
> +	s.nsso = le16_to_cpu(s.nsso);
> +	s.sws = le32_to_cpu(s.sws);
> +	s.sgs = le16_to_cpu(s.sgs);
> +	s.nsa = le16_to_cpu(s.nsa);
> +	s.nso = le16_to_cpu(s.nso);
> +
> +	dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
> +				"sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
> +				s.nsso, s.sws, s.sgs, s.nsa, s.nso);
> +	return 0;
> +}
> +
> +static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +
> +	c.directive.opcode = nvme_admin_directive_recv;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
> +	c.directive.dtype = NVME_DIR_STREAMS;
> +	c.directive.endir = streams;
> +
> +	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_deallocate(struct nvme_ns *ns)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +
> +	c.directive.opcode = nvme_admin_directive_send;
> +	c.directive.nsid = cpu_to_le32(ns->ns_id);
> +	c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
> +	c.directive.dtype = NVME_DIR_STREAMS;
> +
> +	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static void nvme_config_streams(struct nvme_ns *ns)
> +{
> +	int ret;
> +
> +	ret = nvme_enable_streams(ns);
> +	if (ret)
> +		return;
> +
> +	ret = nvme_streams_params(ns);
> +	if (ret)
> +		return;
> +
> +	ret = nvme_streams_allocate(ns, streams_per_ns);
> +	if (ret)
> +		return;
> +
> +	ret = nvme_streams_params(ns);
> +	if (ret)
> +		return;
> +
> +	ns->streams = true;
> +	dev_info(ns->ctrl->device, "successfully enabled streams\n");
> +}
> +
> static char nvme_pr_type(enum pr_type type)
> {
> 	switch (type) {
> @@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> 	ctrl->sgls = le32_to_cpu(id->sgls);
> 	ctrl->kas = le16_to_cpu(id->kas);
> 
> +	if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> +		dev_info(ctrl->dev, "supports directives\n");
> +
> 	ctrl->npss = id->npss;
> 	prev_apsta = ctrl->apsta;
> 	if (ctrl->quirks & NVME_QUIRK_NO_APST) {
> @@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> 		goto out_free_id;
> 	}
> 
> +	if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> +		nvme_config_streams(ns);
> +
> 	disk = alloc_disk_node(0, node);
> 	if (!disk)
> 		goto out_free_id;
> @@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> 					&nvme_ns_attr_group);
> 		if (ns->ndev)
> 			nvme_nvm_unregister_sysfs(ns);
> +		if (ns->streams)
> +			nvme_streams_deallocate(ns);
> 		del_gendisk(ns->disk);
> 		blk_cleanup_queue(ns->queue);
> 	}
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9d6a070d4391..c2d8d23c90de 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -195,6 +195,7 @@ struct nvme_ns {
> 	int lba_shift;
> 	u16 ms;
> 	bool ext;
> +	bool streams;
> 	u8 pi_type;
> 	unsigned long flags;
> 
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index b625bacf37ef..8b2f5b140134 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -245,6 +245,7 @@ enum {
> 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
> 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
> 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
> +	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
> 	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
> };
> 
> @@ -295,6 +296,19 @@ enum {
> };
> 
> enum {
> +	NVME_DIR_IDENTIFY		= 0x00,
> +	NVME_DIR_STREAMS		= 0x01,
> +	NVME_DIR_SND_ID_OP_ENABLE	= 0x01,
> +	NVME_DIR_SND_ST_OP_REL_ID	= 0x01,
> +	NVME_DIR_SND_ST_OP_REL_RSC	= 0x02,
> +	NVME_DIR_RCV_ID_OP_PARAM	= 0x01,
> +	NVME_DIR_RCV_ST_OP_PARAM	= 0x01,
> +	NVME_DIR_RCV_ST_OP_STATUS	= 0x02,
> +	NVME_DIR_RCV_ST_OP_RESOURCE	= 0x03,
> +	NVME_DIR_ENDIR			= 0x01,
> +};
> +
> +enum {
> 	NVME_NS_FEAT_THIN	= 1 << 0,
> 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
> 	NVME_NS_FLBAS_META_EXT	= 0x10,
> @@ -535,6 +549,7 @@ enum {
> 	NVME_RW_PRINFO_PRCHK_APP	= 1 << 11,
> 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
> 	NVME_RW_PRINFO_PRACT		= 1 << 13,
> +	NVME_RW_DTYPE_STREAMS		= 1 << 4,
> };
> 
> struct nvme_dsm_cmd {
> @@ -604,6 +619,8 @@ enum nvme_admin_opcode {
> 	nvme_admin_download_fw		= 0x11,
> 	nvme_admin_ns_attach		= 0x15,
> 	nvme_admin_keep_alive		= 0x18,
> +	nvme_admin_directive_send	= 0x19,
> +	nvme_admin_directive_recv	= 0x1a,
> 	nvme_admin_dbbuf		= 0x7C,
> 	nvme_admin_format_nvm		= 0x80,
> 	nvme_admin_security_send	= 0x81,
> @@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
> 	__u32			rsvd14[2];
> };
> 
> +struct nvme_directive_cmd {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__le32			nsid;
> +	__u64			rsvd2[2];
> +	union nvme_data_ptr	dptr;
> +	__le32			numd;
> +	__u8			doper;
> +	__u8			dtype;
> +	__le16			dspec;
> +	__u8			endir;
> +	__u8			tdtype;
> +	__u16			rsvd15;
> +
> +	__u32			rsvd16[3];
> +};
> +
> /*
>  * Fabrics subcommands.
>  */
> @@ -886,6 +921,18 @@ struct nvme_dbbuf {
> 	__u32			rsvd12[6];
> };
> 
> +struct streams_directive_params {
> +	__u16	msl;
> +	__u16	nssa;
> +	__u16	nsso;
> +	__u8	rsvd[10];
> +	__u32	sws;
> +	__u16	sgs;
> +	__u16	nsa;
> +	__u16	nso;
> +	__u8	rsvd2[6];
> +};
> +
> struct nvme_command {
> 	union {
> 		struct nvme_common_command common;
> @@ -906,6 +953,7 @@ struct nvme_command {
> 		struct nvmf_property_set_command prop_set;
> 		struct nvmf_property_get_command prop_get;
> 		struct nvme_dbbuf dbbuf;
> +		struct nvme_directive_cmd directive;
> 	};
> };
> 
> --
> 2.7.4
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/11] nvme: add support for streams and directives
  2017-06-13 21:12   ` Andreas Dilger
@ 2017-06-13 21:18     ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 21:18 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block

On 06/13/2017 03:12 PM, Andreas Dilger wrote:
>> +static int nvme_streams_params(struct nvme_ns *ns)
>> +{
>> +	struct nvme_ctrl *ctrl = ns->ctrl;
>> +	struct streams_directive_params s;
>> +	struct nvme_command c;
>> +	int ret;
>> +
>> +	memset(&c, 0, sizeof(c));
>> +	memset(&s, 0, sizeof(s));
> 
> Minor nit in a few places in this code - is it more efficient to initialize
> these structs at declaration time instead of calling memset(), like:
> 
> 	struct streams_directive_params s = {};
> 	struct nvme_command c = {};
> 
> or alternately initializing them fully at declaration time to avoid zeroing
> fields that are immediately being set:
> 
> 	struct nvme_command c =
> 	{
> 		.directive.opcode = nvme_admin_directive_recv;
> 		.directive.nsid = cpu_to_le32(ns->ns_id);
> 		.directive.numd = sizeof(s);
> 		.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
> 		.directive.dtype = NVME_DIR_STREAMS;
> 	};

I'm just following the style of the driver for that part.
 
-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 20:56           ` Jens Axboe
@ 2017-06-13 21:53             ` Andreas Dilger
  2017-06-13 22:12               ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Andreas Dilger @ 2017-06-13 21:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, Michael Kerrisk-manpages

[-- Attachment #1: Type: text/plain, Size: 6770 bytes --]

On Jun 13, 2017, at 2:56 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 06/13/2017 02:45 PM, Andreas Dilger wrote:
>>>>>> I thought that one of the major attractions of numeric stream IDs was
>>>>>> that they had no semantic meanings, just "N is similar to N" and "M is
>>>>>> similar to M", and it is up to userspace to define what these mean?
>>>>>> 
>>>>>> That allows userspace to use the IDs for lifetimes (as above), but
>>>>>> also/instead use them for allocation sizes, different applications,
>>>>>> different users, etc.
>>>>> 
>>>>> Right, that is indeed the intent. But we have to attach some naming
>>>>> to them. Userspace could in theory use these totally randomly, and
>>>>> things like NVMe would not care. But the semantic meaning of "short"
>>>>> vs "long" is important on caching infrastructure where you might
>>>>> want to use the hint for data placement.
>>>>> 
>>>>> I think the important part here is that no absolute meaning is
>>>>> attached to them, only relative.
>>>> 
>>>> In both IOCB_WRITE_LIFE_* and RWF_WRITE_LIFE_* this is consuming 4 bits of
>>>> space (which is itself fine) for only 4 different stream IDs.  Why not just
>>>> shift a 4-bit arbitrary stream ID to the appropriate offset in those fields,
>>>> rather than treating them as 4 individual bits and allowing only one of
>>>> them to be passed down the stack at a time?
>>> 
>>> I did think about that, and I'm a bit split on it. It turns a bit mask
>>> into a hybrid beast, with bits and sets of bits for values.
>> 
>> I don't think that is too confusing for anyone.
> 
> Agree, especially because it's a win on multiple fronts. I've made that
> change (see the write-stream.2 branch).
> 
>>> For utilization of the space, yes, we could just use 2 bits instead of
>>> the 4. Or use the 4 bits and potentially have the app pass in up to 16
>>> values. For the latter, I'm still very much in favor of keeping the app
>>> interface super simple and just retaining the 4 life time types.
>> 
>> I don't think anyone cares too much about 4 bits.  Strictly speaking,
>> the current implementation can't fit into 2 bits because it has 5 values
>> if one includes "no ID".
> 
> Right, I ended up taking 3 bits. That allows 7 valid stream IDs, and one
> for unknown.

I wouldn't mind if that was 4 bits, especially for the userspace interface.
The internal bits are easily changed, but it won't be possible to get more
contiguous bits in the RWF_* flags if a later one is used for something else.


I guess that implies (before Michael Kerrisk asks :-) that this also needs
a pwritev2(2) man page update to describe these flags.  I'd add something like:

    The RWF_WRITE_LIFE_{SHORT..EXTREME} flags are recommended for
    identifying stream IDs of different write classes, so that multiple
    users or applications running on a single storage back-end can
    aggregate their IO patterns in a consistent manner.  However, there
    are no functional semantics implied by these flags, and different
    IO classes can use the stream IDs in arbitrary ways so long as they
    are used consistently.

>> It is a bit of overhead to use 32 bits in most of the structs where this
>> field is actually stored.  I suspect that using only 8 or 16 bits in the
>> structs is better, and even if it doesn't find an existing hole ("pahole"
>> is your friend here) it will allow something else to use the remaining
>> space in the future.
> 
> I packed the i_stream into a hole, and the bio part is filling a hole
> too. So while that's not completely free, it's better than taking more
> space.
> 
>> For userspace, I think it is fine to define the WRITE_LIFE values as they
>> are today, and most users will use them, but IMHO it makes sense to
>> allow arbitrary IDs as they see fit, especially if the underlying hardware
>> doesn't care much about the actual values.
> 
> I change the WRITE_LIFT values to just be 1, 2, 3, 4, but shifted up.
> Might as well keep them as values, now we've changed the space in the
> flags field.
> 
> I'll throw this out for more commenting soon. Meanwhile, I'd appreciate
> if you could just take a look at the write-stream.2 branch:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=write-stream.2
> 
> as it should address all your concerns so far. Thanks! Until I post it,
> I may rebase it... I'm going to throw the basic testing at it to ensure
> it still works as designed before posting again.

>  #define RWF_DSYNC	0x00000002 /* per-IO O_DSYNC */
>  #define RWF_SYNC	0x00000004 /* per-IO O_SYNC */
> 
>  /*
> + * Data life time write flags, steal 3 bits for that
> + */
> +#define RWF_WRITE_LIFE_SHIFT	3
> +#define RWF_WRITE_LIFE_MASK	((1 << 3) | (1 << 4) | (1 << 5))
> +#define RWF_WRITE_LIFE_SHORT	(1 << RWF_WRITE_LIFT_SHIFT)
> +#define RWF_WRITE_LIFE_MEDIUM	(2 << RWF_WRITE_LIFT_SHIFT)
> +#define RWF_WRITE_LIFE_LONG	(3 << RWF_WRITE_LIFT_SHIFT)
> +#define RWF_WRITE_LIFE_EXTREME	(4 << RWF_WRITE_LIFT_SHIFT)

I'd recommend to use SHIFT=4 and then define the mask as:

 #define RWF_SYNC		0x00000004 /* per-IO O_SYNC */
+#define RWF_WRITE_LIFE_MASK	0x000000f0 /* 4 bits stream IDs */


so that it is in the same format as the other RWF_* masks here.  The
actual SHORT, ..., EXTREME definitions can stay the same if you want.
Otherwise it is error-prone to map ((1 << 3) | (1 << 4) | (1 << 5))
into the RWF_ space to determine which is the next flag available.

Also, leaving one bit free before RWF_WRITE_LIFE_MASK gives us room
for another flag to be added before the mask until we get cut off on
the high end if others think that 4 bits isn't enough stream IDs.
It might even be prudent to move this up to 0x000f0000 (SHIFT=16) to
give us a bit more breathing room in that regard, given that this is
essentially going to become a permanent userspace API?

The rest of the patches look fine.  You might consider to merge patches
#2 and #3, since the WRITE_LIFE_* constants are only used by the blk_mq
patch (inode_init_always() could just use "0" for initialization).  You
may also want to move the NVMe patch before the ext4/xfs/btrfs patches,
since it isn't strictly dependent on them, and allows some testing on
real hardware once the O_DIRECT patch is available.

The other thing that might be interesting to include in the NVMe patch
is what kind of hardware support is available for the Streams Directive
(i.e. in existing hardware, draft spec only, how many stream IDs are
available, etc.) and what kind of performance benefit this provides
(even if not quantitative).  As it stands now, unless you already know
what this patch series does, it doesn't explain very much.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/11] Add support for write life time hints
  2017-06-13 21:53             ` Andreas Dilger
@ 2017-06-13 22:12               ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-13 22:12 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-block, Michael Kerrisk-manpages

On 06/13/2017 03:53 PM, Andreas Dilger wrote:
>>>> For utilization of the space, yes, we could just use 2 bits instead of
>>>> the 4. Or use the 4 bits and potentially have the app pass in up to 16
>>>> values. For the latter, I'm still very much in favor of keeping the app
>>>> interface super simple and just retaining the 4 life time types.
>>>
>>> I don't think anyone cares too much about 4 bits.  Strictly speaking,
>>> the current implementation can't fit into 2 bits because it has 5 values
>>> if one includes "no ID".
>>
>> Right, I ended up taking 3 bits. That allows 7 valid stream IDs, and one
>> for unknown.
> 
> I wouldn't mind if that was 4 bits, especially for the userspace
> interface.  The internal bits are easily changed, but it won't be
> possible to get more contiguous bits in the RWF_* flags if a later one
> is used for something else.

Sure, find with me, simple change too.

> I guess that implies (before Michael Kerrisk asks :-) that this also
> needs a pwritev2(2) man page update to describe these flags.  I'd add
> something like:
> 
>     The RWF_WRITE_LIFE_{SHORT..EXTREME} flags are recommended for
>     identifying stream IDs of different write classes, so that multiple
>     users or applications running on a single storage back-end can
>     aggregate their IO patterns in a consistent manner.  However, there
>     are no functional semantics implied by these flags, and different
>     IO classes can use the stream IDs in arbitrary ways so long as they
>     are used consistently.

Looks good to me. Maybe add something about the kernel being free to
ignore them as well, if the underlying storage backend doesn't support
it.

>>> It is a bit of overhead to use 32 bits in most of the structs where this
>>> field is actually stored.  I suspect that using only 8 or 16 bits in the
>>> structs is better, and even if it doesn't find an existing hole ("pahole"
>>> is your friend here) it will allow something else to use the remaining
>>> space in the future.
>>
>> I packed the i_stream into a hole, and the bio part is filling a hole
>> too. So while that's not completely free, it's better than taking more
>> space.
>>
>>> For userspace, I think it is fine to define the WRITE_LIFE values as they
>>> are today, and most users will use them, but IMHO it makes sense to
>>> allow arbitrary IDs as they see fit, especially if the underlying hardware
>>> doesn't care much about the actual values.
>>
>> I change the WRITE_LIFT values to just be 1, 2, 3, 4, but shifted up.
>> Might as well keep them as values, now we've changed the space in the
>> flags field.
>>
>> I'll throw this out for more commenting soon. Meanwhile, I'd appreciate
>> if you could just take a look at the write-stream.2 branch:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=write-stream.2
>>
>> as it should address all your concerns so far. Thanks! Until I post it,
>> I may rebase it... I'm going to throw the basic testing at it to ensure
>> it still works as designed before posting again.
> 
>>  #define RWF_DSYNC	0x00000002 /* per-IO O_DSYNC */
>>  #define RWF_SYNC	0x00000004 /* per-IO O_SYNC */
>>
>>  /*
>> + * Data life time write flags, steal 3 bits for that
>> + */
>> +#define RWF_WRITE_LIFE_SHIFT	3
>> +#define RWF_WRITE_LIFE_MASK	((1 << 3) | (1 << 4) | (1 << 5))
>> +#define RWF_WRITE_LIFE_SHORT	(1 << RWF_WRITE_LIFT_SHIFT)
>> +#define RWF_WRITE_LIFE_MEDIUM	(2 << RWF_WRITE_LIFT_SHIFT)
>> +#define RWF_WRITE_LIFE_LONG	(3 << RWF_WRITE_LIFT_SHIFT)
>> +#define RWF_WRITE_LIFE_EXTREME	(4 << RWF_WRITE_LIFT_SHIFT)
> 
> I'd recommend to use SHIFT=4 and then define the mask as:
> 
>  #define RWF_SYNC		0x00000004 /* per-IO O_SYNC */
> +#define RWF_WRITE_LIFE_MASK	0x000000f0 /* 4 bits stream IDs */
> 
> 
> so that it is in the same format as the other RWF_* masks here.  The
> actual SHORT, ..., EXTREME definitions can stay the same if you want.
> Otherwise it is error-prone to map ((1 << 3) | (1 << 4) | (1 << 5))
> into the RWF_ space to determine which is the next flag available.

OK, fine with me.

> Also, leaving one bit free before RWF_WRITE_LIFE_MASK gives us room
> for another flag to be added before the mask until we get cut off on
> the high end if others think that 4 bits isn't enough stream IDs.
> It might even be prudent to move this up to 0x000f0000 (SHIFT=16) to
> give us a bit more breathing room in that regard, given that this is
> essentially going to become a permanent userspace API?

Not so sure on that one. The whole point was not to make this overly
complicated, or add a ton of hints that nobody would then know how to
use properly. I guess it's not a big concern the way it's setup now,
where we don't tie any particular meaning to them. Users are free to use
the different bits anyway they see fit and it will work, as long as they
are consistent in how they use it.

But for now, I'd prefer to keep it at 4 bits.

> The rest of the patches look fine.  You might consider to merge patches
> #2 and #3, since the WRITE_LIFE_* constants are only used by the blk_mq
> patch (inode_init_always() could just use "0" for initialization).  You
> may also want to move the NVMe patch before the ext4/xfs/btrfs patches,
> since it isn't strictly dependent on them, and allows some testing on
> real hardware once the O_DIRECT patch is available.

OK, I can fold those two and reorder a bit. There's no real dependency
between the later half of the series, really.

> The other thing that might be interesting to include in the NVMe patch
> is what kind of hardware support is available for the Streams Directive
> (i.e. in existing hardware, draft spec only, how many stream IDs are
> available, etc.) and what kind of performance benefit this provides
> (even if not quantitative).  As it stands now, unless you already know
> what this patch series does, it doesn't explain very much.

Yes, that's useful info. On device support, I've got samples from
various vendors, but I don't know what is public yet. So I can't really
speak to number of streams available. This is ratified in the NVMe 1.3
spec, however, so that part is public.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information
  2017-06-14 19:05 [PATCHSET v3] " Jens Axboe
@ 2017-06-14 19:05 ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-14 19:05 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c     | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..31ba4a8f0a28 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 			should_dirty = true;
 	} else {
 		bio.bi_opf = dio_bio_write_op(iocb);
+		bio_set_streamid(&bio, iocb_streamid(iocb));
 		task_io_account_write(ret);
 	}
 
@@ -374,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio_set_pages_dirty(bio);
 		} else {
 			bio->bi_opf = dio_bio_write_op(iocb);
+			bio_set_streamid(bio, iocb_streamid(iocb));
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..a770e82bb9f8 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	bio_set_streamid(bio, iocb_streamid(dio->iocb));
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..fa7d29632fbc 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+			bio_set_streamid(bio, inode_streamid(inode));
 			task_io_account_write(bio->bi_iter.bi_size);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information
  2017-06-15  3:45 [PATCHSET v4] Add support for write life time hints Jens Axboe
@ 2017-06-15  3:45 ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2017-06-15  3:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c     | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..de4301168710 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 			should_dirty = true;
 	} else {
 		bio.bi_opf = dio_bio_write_op(iocb);
+		bio.bi_opf |= bio_op_write_hint(iocb_write_hint(iocb));
 		task_io_account_write(ret);
 	}
 
@@ -374,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio_set_pages_dirty(bio);
 		} else {
 			bio->bi_opf = dio_bio_write_op(iocb);
+			bio->bi_opf |= bio_op_write_hint(iocb_write_hint(iocb));
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..98874478ec8a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	bio->bi_opf |= bio_op_write_hint(iocb_write_hint(dio->iocb));
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..7e18e760e421 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+			bio->bi_opf |= bio_op_write_hint(inode_write_hint(inode));
 			task_io_account_write(bio->bi_iter.bi_size);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-06-15  3:45 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
2017-06-13 17:15 ` [PATCH 01/11] block: add support for carrying stream information in a bio Jens Axboe
2017-06-13 19:01   ` Andreas Dilger
2017-06-13 17:15 ` [PATCH 02/11] block: add definition for support data write life times Jens Axboe
2017-06-13 19:07   ` Andreas Dilger
2017-06-13 20:10     ` Jens Axboe
2017-06-13 17:15 ` [PATCH 03/11] blk-mq: expose stream write stats through debugfs Jens Axboe
2017-06-13 19:21   ` Andreas Dilger
2017-06-13 20:11     ` Jens Axboe
2017-06-13 17:15 ` [PATCH 04/11] fs: add support for an inode to carry stream related data Jens Axboe
2017-06-13 19:24   ` [PATCH 04/11] " Andreas Dilger
2017-06-13 20:14     ` Jens Axboe
2017-06-13 17:15 ` [PATCH 05/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
2017-06-13 19:36   ` [PATCH 05/11] " Andreas Dilger
2017-06-13 20:21     ` Jens Axboe
2017-06-13 17:15 ` [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
2017-06-13 19:38   ` Andreas Dilger
2017-06-13 17:15 ` [PATCH 07/11] fs: add support for buffered writeback to pass down " Jens Axboe
2017-06-13 19:39   ` [PATCH 07/11] " Andreas Dilger
2017-06-13 17:15 ` [PATCH 08/11] ext4: add support for passing in stream information for buffered writes Jens Axboe
2017-06-13 19:40   ` Andreas Dilger
2017-06-13 17:15 ` [PATCH 09/11] xfs: " Jens Axboe
2017-06-13 19:40   ` Andreas Dilger
2017-06-13 17:15 ` [PATCH 10/11] btrfs: " Jens Axboe
2017-06-13 19:41   ` Andreas Dilger
2017-06-13 17:15 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
2017-06-13 19:47   ` Andreas Dilger
2017-06-13 20:25     ` Jens Axboe
2017-06-13 21:12   ` Andreas Dilger
2017-06-13 21:18     ` Jens Axboe
2017-06-13 18:04 ` [PATCH 0/11] Add support for write life time hints Andreas Dilger
2017-06-13 18:26   ` Jens Axboe
2017-06-13 19:21     ` Andreas Dilger
2017-06-13 20:13       ` Jens Axboe
2017-06-13 20:45         ` Andreas Dilger
2017-06-13 20:56           ` Jens Axboe
2017-06-13 21:53             ` Andreas Dilger
2017-06-13 22:12               ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2017-06-14 19:05 [PATCHSET v3] " Jens Axboe
2017-06-14 19:05 ` [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
2017-06-15  3:45 [PATCHSET v4] Add support for write life time hints Jens Axboe
2017-06-15  3:45 ` [PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).