* [PATCH 01/11] fs: add support for an inode to carry write hint related data
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 02/11] block: add support for write hints in a bio Jens Axboe
` (9 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
No functional changes in this patch, just in preparation for
allowing applications to pass in hints about data life times
for writes. Set aside 3 bits for carrying hint information
in the inode flags.
Adds the public hints as well, which are:
WRITE_LIFE_NONE No hints about write life time
WRITE_LIFE_SHORT Data written has a short life time
WRITE_LIFE_MEDIUM Data written has a medium life time
WRITE_LIFE_LONG Data written has a long life time
WRITE_LIFE_EXTREME Data written has an extremely long life tim
Helpers are defined to store these values in flags, by passing in
the shift that's appropriate for the given use case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/inode.c | 11 +++++++++++
include/linux/fs.h | 29 +++++++++++++++++++++++++++++
include/uapi/linux/fs.h | 13 +++++++++++++
3 files changed, 53 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..defb015a2c6d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2120,3 +2120,14 @@ struct timespec current_time(struct inode *inode)
return timespec_trunc(now, inode->i_sb->s_time_gran);
}
EXPORT_SYMBOL(current_time);
+
+void inode_set_write_hint(struct inode *inode, enum rw_hint hint)
+{
+ unsigned int flags = write_hint_to_mask(hint, S_WRITE_LIFE_SHIFT);
+
+ if (flags != mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
+ inode_lock(inode);
+ inode_set_flags(inode, flags, S_WRITE_LIFE_MASK);
+ inode_unlock(inode);
+ }
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..472c83156606 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1828,6 +1828,14 @@ struct super_operations {
#endif
/*
+ * Expected life time hint of a write for this inode. This uses the
+ * WRITE_LIFE_* encoding, we just need to define the shift. We need
+ * 3 bits for this. Next S_* value is 131072, bit 17.
+ */
+#define S_WRITE_LIFE_MASK 0x1c000 /* bits 14..16 */
+#define S_WRITE_LIFE_SHIFT 14 /* 16384, next bit */
+
+/*
* Note that nosuid etc flags are inode-specific: setting some file-system
* flags just means all the inodes inherit those flags by default. It might be
* possible to override it selectively if you really wanted to with some
@@ -1873,6 +1881,26 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
}
+static inline unsigned int write_hint_to_mask(enum rw_hint hint,
+ unsigned int shift)
+{
+ return hint << shift;
+}
+
+static inline enum rw_hint mask_to_write_hint(unsigned int mask,
+ unsigned int shift)
+{
+ return (mask >> shift) & 0x7;
+}
+
+static inline unsigned int inode_write_hint(struct inode *inode)
+{
+ if (inode)
+ return mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+
+ return 0;
+}
+
/*
* Inode state bits. Protected by inode->i_lock
*
@@ -2757,6 +2785,7 @@ extern struct inode *new_inode(struct super_block *sb);
extern void free_inode_nonrcu(struct inode *inode);
extern int should_remove_suid(struct dentry *);
extern int file_remove_privs(struct file *);
+extern void inode_set_write_hint(struct inode *inode, enum rw_hint hint);
extern void __insert_inode_hash(struct inode *, unsigned long hashval);
static inline void insert_inode_hash(struct inode *inode)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..8fb3b5a6e1ec 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -356,6 +356,19 @@ struct fscrypt_key {
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4
+/*
+ * Write life time hint values.
+ */
+enum rw_hint {
+ WRITE_LIFE_NONE = 0,
+ WRITE_LIFE_SHORT,
+ WRITE_LIFE_MEDIUM,
+ WRITE_LIFE_LONG,
+ WRITE_LIFE_EXTREME,
+};
+
+#define RW_HINT_MASK 0x7 /* 3 bits */
+
/* flags for preadv2/pwritev2: */
#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */
#define RWF_DSYNC 0x00000002 /* per-IO O_DSYNC */
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 02/11] block: add support for write hints in a bio
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
2017-06-16 17:24 ` [PATCH 01/11] fs: add support for an inode to carry write hint related data Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 03/11] blk-mq: expose stream write hints through debugfs Jens Axboe
` (8 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
No functional changes in this patch, we just set aside 3 bits
in the bio/request flags, which can be used to hold a WRITE_HINT_*
life time hint.
Ensure that we don't merge requests that have different life time
hints assigned to them.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-merge.c | 16 ++++++++++++++++
include/linux/blk_types.h | 15 +++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..7d299df3b12b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,14 @@ static struct request *attempt_merge(struct request_queue *q,
return NULL;
/*
+ * Don't allow merge of different streams, or for a stream with
+ * non-stream IO.
+ */
+ if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+ (next->cmd_flags & REQ_WRITE_LIFE_MASK))
+ return NULL;
+
+ /*
* If we are allowed to merge, then append bio list
* from next to rq and release next. merge_requests_fn
* will have updated segment counts, update sector
@@ -811,6 +819,14 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
!blk_write_same_mergeable(rq->bio, bio))
return false;
+ /*
+ * Don't allow merge of different streams, or for a stream with
+ * non-stream IO.
+ */
+ if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+ (bio->bi_opf & REQ_WRITE_LIFE_MASK))
+ return false;
+
return true;
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..0dd8d569d52f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -7,6 +7,7 @@
#include <linux/types.h>
#include <linux/bvec.h>
+#include <linux/fs.h>
struct bio_set;
struct bio;
@@ -201,6 +202,9 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD, /* read ahead, can fail anytime */
__REQ_BACKGROUND, /* background IO */
+ __REQ_WRITE_HINT_SHIFT, /* 3 bits for life time hint */
+ __REQ_WRITE_HINT_PAD1,
+ __REQ_WRITE_HINT_PAD2,
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
@@ -221,6 +225,12 @@ enum req_flag_bits {
#define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH)
#define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
+#define REQ_WRITE_SHORT (WRITE_LIFE_SHORT << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_MEDIUM (WRITE_LIFE_MEDIUM << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_LONG (WRITE_LIFE_LONG << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_EXTREME (WRITE_LIFE_EXTREME << __REQ_WRITE_HINT_SHIFT)
+
+#define REQ_WRITE_LIFE_MASK (0x7 << __REQ_WRITE_HINT_SHIFT)
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
@@ -312,4 +322,9 @@ struct blk_rq_stat {
u64 batch;
};
+static inline unsigned int write_hint_to_opf(enum rw_hint hint)
+{
+ return hint << __REQ_WRITE_HINT_SHIFT;
+}
+
#endif /* __LINUX_BLK_TYPES_H */
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 03/11] blk-mq: expose stream write hints through debugfs
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
2017-06-16 17:24 ` [PATCH 01/11] fs: add support for an inode to carry write hint related data Jens Axboe
2017-06-16 17:24 ` [PATCH 02/11] block: add support for write hints in a bio Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
` (7 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
Useful to verify that things are working the way they should.
Reading the file will return number of kb written with each
write hint. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.
Drivers will write to q->write_hints[] if they handle a given
write hint.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
include/linux/blkdev.h | 3 +++
2 files changed, 27 insertions(+)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 803aed4d7221..b069a76f7fc7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
}
}
+static int queue_write_hint_show(void *data, struct seq_file *m)
+{
+ struct request_queue *q = data;
+ int i;
+
+ for (i = 0; i < BLK_MAX_WRITE_HINTS; i++)
+ seq_printf(m, "hint%d: %llu\n", i, q->write_hints[i]);
+
+ return 0;
+}
+
+static ssize_t queue_write_hint_store(void *data, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct request_queue *q = data;
+ int i;
+
+ for (i = 0; i < BLK_MAX_WRITE_HINTS; i++)
+ q->write_hints[i] = 0;
+
+ return count;
+}
+
static int queue_poll_stat_show(void *data, struct seq_file *m)
{
struct request_queue *q = data;
@@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{"poll_stat", 0400, queue_poll_stat_show},
{"state", 0600, queue_state_show, queue_state_write},
+ {"write_hints", 0600, queue_write_hint_show, queue_write_hint_store},
{},
};
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b74a3edcb3da..ab36068f6129 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -588,6 +588,9 @@ struct request_queue {
void *rq_alloc_data;
struct work_struct release_work;
+
+#define BLK_MAX_WRITE_HINTS 5
+ u64 write_hints[BLK_MAX_WRITE_HINTS];
};
#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (2 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 03/11] blk-mq: expose stream write hints through debugfs Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 05/11] fs: add fcntl() interface for setting/getting " Jens Axboe
` (6 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
Add four flags for the pwritev2(2) system call, allowing an application
to give the kernel a hint about what on-media life times can be
expected from a given write.
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Set aside 3 bits in the iocb flags structure to carry this information
over from the pwritev2 RWF_WRITE_LIFE_* flags.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/read_write.c | 12 +++++++++++-
include/linux/fs.h | 12 ++++++++++++
include/uapi/linux/fs.h | 10 ++++++++++
3 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 19d4d88fa285..975fe1d46a59 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -675,10 +675,11 @@ EXPORT_SYMBOL(iov_shorten);
static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
loff_t *ppos, int type, int flags)
{
+ struct inode *inode = file_inode(filp);
struct kiocb kiocb;
ssize_t ret;
- if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
+ if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
return -EOPNOTSUPP;
init_sync_kiocb(&kiocb, filp);
@@ -688,6 +689,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
kiocb.ki_flags |= IOCB_DSYNC;
if (flags & RWF_SYNC)
kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+ if ((flags & RWF_WRITE_LIFE_MASK) ||
+ mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
+ enum rw_hint hint;
+
+ hint = mask_to_write_hint(flags, RWF_WRITE_LIFE_SHIFT);
+
+ inode_set_write_hint(inode, hint);
+ kiocb.ki_flags |= write_hint_to_mask(hint, IOCB_WRITE_LIFE_SHIFT);
+ }
kiocb.ki_pos = *ppos;
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 472c83156606..a024b32259bf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -269,6 +269,12 @@ struct writeback_control;
#define IOCB_SYNC (1 << 5)
#define IOCB_WRITE (1 << 6)
+/*
+ * Steal 3 bits for stream information, this allows 8 valid streams
+ */
+#define IOCB_WRITE_LIFE_SHIFT 7
+#define IOCB_WRITE_LIFE_MASK (BIT(7) | BIT(8) | BIT(9))
+
struct kiocb {
struct file *ki_filp;
loff_t ki_pos;
@@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
};
}
+static inline int iocb_write_hint(const struct kiocb *iocb)
+{
+ return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
+ IOCB_WRITE_LIFE_SHIFT;
+}
+
/*
* "descriptor" for what we're up to with a read.
* This allows us to use the same read code yet
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 8fb3b5a6e1ec..0d9d331d3b61 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -374,4 +374,14 @@ enum rw_hint {
#define RWF_DSYNC 0x00000002 /* per-IO O_DSYNC */
#define RWF_SYNC 0x00000004 /* per-IO O_SYNC */
+/*
+ * Data life time write flags, steal 3 bits for that
+ */
+#define RWF_WRITE_LIFE_SHIFT 4
+#define RWF_WRITE_LIFE_MASK 0x00000070 /* 3 bits of write hints */
+#define RWF_WRITE_LIFE_SHORT (WRITE_LIFE_SHORT << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_MEDIUM (WRITE_LIFE_MEDIUM << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_LONG (WRITE_LIFE_LONG << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_EXTREME (WRITE_LIFE_EXTREME << RWF_WRITE_LIFE_SHIFT)
+
#endif /* _UAPI_LINUX_FS_H */
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 05/11] fs: add fcntl() interface for setting/getting write life time hints
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (3 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 06/11] fs: add O_DIRECT support for sending down " Jens Axboe
` (5 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
We have a pwritev2(2) interface based on passing in flags. Add an
fcntl interface for querying these flags, and also for setting them
as well:
F_GET_RW_HINT Returns the read/write hint set. Right now it
will be one of the WRITE_LIFE_* values.
F_SET_RW_HINT Pass in rw_hint type to set the read/write hint.
Only WRITE_LIFE_* hints are currently supported.
Returns 0 on succes, -1 otherwise.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/fcntl.c | 38 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fcntl.h | 6 ++++++
2 files changed, 44 insertions(+)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index f4e7267d117f..417ce336c875 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -243,6 +243,40 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
}
#endif
+long fcntl_rw_hint(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct inode *inode = file_inode(file);
+ long ret;
+
+ switch (cmd) {
+ case F_GET_RW_HINT:
+ ret = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+ break;
+ case F_SET_RW_HINT: {
+ enum rw_hint hint = arg;
+
+ switch (hint) {
+ case WRITE_LIFE_NONE:
+ case WRITE_LIFE_SHORT:
+ case WRITE_LIFE_MEDIUM:
+ case WRITE_LIFE_LONG:
+ case WRITE_LIFE_EXTREME:
+ inode_set_write_hint(inode, hint);
+ ret = 0;
+ break;
+ default:
+ ret = -EINVAL;
+ }
+ break;
+ }
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
struct file *filp)
{
@@ -337,6 +371,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
case F_GET_SEALS:
err = shmem_fcntl(filp, cmd, arg);
break;
+ case F_GET_RW_HINT:
+ case F_SET_RW_HINT:
+ err = fcntl_rw_hint(filp, cmd, arg);
+ break;
default:
break;
}
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 813afd6eee71..bd24c9b48bd0 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,12 @@
/* (1U << 31) is reserved for signed error codes */
/*
+ * Set/Get write life time hints
+ */
+#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
+#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 20)
+
+/*
* Types of directory notifications that may be requested.
*/
#define DN_ACCESS 0x00000001 /* File accessed */
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 06/11] fs: add O_DIRECT support for sending down write life time hints
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (4 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 05/11] fs: add fcntl() interface for setting/getting " Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 07/11] fs: add support for buffered writeback to pass down write hints Jens Axboe
` (4 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/block_dev.c | 2 ++
fs/direct-io.c | 2 ++
fs/iomap.c | 1 +
3 files changed, 5 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..b9222a9d285e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
should_dirty = true;
} else {
bio.bi_opf = dio_bio_write_op(iocb);
+ bio.bi_opf |= write_hint_to_opf(iocb_write_hint(iocb));
task_io_account_write(ret);
}
@@ -374,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
bio_set_pages_dirty(bio);
} else {
bio->bi_opf = dio_bio_write_op(iocb);
+ bio->bi_opf |= write_hint_to_opf(iocb_write_hint(iocb));
task_io_account_write(bio->bi_iter.bi_size);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..1253b059eae6 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
+ bio->bi_opf |= write_hint_to_opf(iocb_write_hint(dio->iocb));
+
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
}
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..e8639861bd80 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
if (dio->flags & IOMAP_DIO_WRITE) {
bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+ bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode));
task_io_account_write(bio->bi_iter.bi_size);
} else {
bio_set_op_attrs(bio, REQ_OP_READ, 0);
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 07/11] fs: add support for buffered writeback to pass down write hints
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (5 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 06/11] fs: add O_DIRECT support for sending down " Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 08/11] ext4: add support for passing in write hints for buffered writes Jens Axboe
` (3 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/buffer.c | 14 +++++++++-----
fs/mpage.c | 1 +
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..187e54fb0382 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
- struct writeback_control *wbc);
+ unsigned int stream, struct writeback_control *wbc);
#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
- submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+ submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+ inode_write_hint(inode), wbc);
nr_underway++;
}
bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
- submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+ submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+ inode_write_hint(inode), wbc);
nr_underway++;
}
bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
}
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
- struct writeback_control *wbc)
+ unsigned int write_hint, struct writeback_control *wbc)
{
struct bio *bio;
@@ -3134,6 +3136,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
op_flags |= REQ_META;
if (buffer_prio(bh))
op_flags |= REQ_PRIO;
+
+ op_flags |= write_hint_to_opf(write_hint);
bio_set_op_attrs(bio, op, op_flags);
submit_bio(bio);
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
int submit_bh(int op, int op_flags, struct buffer_head *bh)
{
- return submit_bh_wbc(op, op_flags, bh, NULL);
+ return submit_bh_wbc(op, op_flags, bh, 0, NULL);
}
EXPORT_SYMBOL(submit_bh);
diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..e84dcaee6ec7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
goto confused;
wbc_init_bio(wbc, bio);
+ bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode));
}
/*
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 08/11] ext4: add support for passing in write hints for buffered writes
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (6 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 07/11] fs: add support for buffered writeback to pass down write hints Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 09/11] xfs: " Jens Axboe
` (2 subsequent siblings)
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/ext4/page-io.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..fbc04a8cb35a 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -349,6 +349,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
if (bio) {
int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
REQ_SYNC : 0;
+ io_op_flags |= write_hint_to_opf(inode_write_hint(io->io_end->inode));
bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
submit_bio(io->io_bio);
}
@@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
ret = io_submit_init_bio(io, bh);
if (ret)
return ret;
+ io->io_bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode));
}
ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
if (ret != bh->b_size)
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 09/11] xfs: add support for passing in write hints for buffered writes
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (7 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 08/11] ext4: add support for passing in write hints for buffered writes Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 10/11] btrfs: " Jens Axboe
2017-06-16 17:24 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/xfs/xfs_aops.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 09af0f7cd55e..040789c51930 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,6 +505,7 @@ xfs_submit_ioend(
return status;
}
+ ioend->io_bio->bi_opf |= write_hint_to_opf(inode_write_hint(ioend->io_inode));
submit_bio(ioend->io_bio);
return 0;
}
@@ -564,6 +565,7 @@ xfs_chain_bio(
bio_chain(ioend->io_bio, new);
bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+ ioend->io_bio->bi_opf |= write_hint_to_opf(inode_write_hint(ioend->io_inode));
submit_bio(ioend->io_bio);
ioend->io_bio = new;
}
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 10/11] btrfs: add support for passing in write hints for buffered writes
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (8 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 09/11] xfs: " Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 17:24 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/btrfs/extent_io.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..92e75e4014ea 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2826,6 +2826,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
bio_add_page(bio, page, page_size, offset);
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
+ op_flags |= write_hint_to_opf(inode_write_hint(page->mapping->host));
bio_set_op_attrs(bio, op, op_flags);
if (wbc) {
wbc_init_bio(wbc, bio);
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 11/11] nvme: add support for streams and directives
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
` (9 preceding siblings ...)
2017-06-16 17:24 ` [PATCH 10/11] btrfs: " Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
2017-06-16 18:09 ` Christoph Hellwig
10 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data,
so that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and
life time of the device.
We default to allocating 4 streams per name space, but it is
configurable with the 'streams_per_ns' module option. If a write stream
is set in a write, flag is as such before sending it to the device. The
streams are allocated lazily - if we get a write request with a life
time hint, then background allocate streams and use them once that
is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
drivers/nvme/host/core.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 5 ++
include/linux/nvme.h | 48 ++++++++++++++
3 files changed, 221 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..48a5acea5105 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
module_param(force_apst, bool, 0644);
MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
+
static LIST_HEAD(nvme_ctrl_list);
static DEFINE_SPINLOCK(dev_list_lock);
@@ -331,9 +335,152 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_MQ_RQ_QUEUE_OK;
}
+static int nvme_enable_streams(struct nvme_ctrl *ctrl)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(0xffffffff);
+ c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+ c.directive.dtype = NVME_DIR_IDENTIFY;
+ c.directive.tdtype = NVME_DIR_STREAMS;
+ c.directive.endir = NVME_DIR_ENDIR;
+
+ return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_probe_directives(struct nvme_ctrl *ctrl)
+{
+ struct streams_directive_params s;
+ struct nvme_command c;
+ int ret;
+
+ if (!(ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+ return 0;
+
+ ret = nvme_enable_streams(ctrl);
+ if (ret)
+ return ret;
+
+ memset(&c, 0, sizeof(c));
+ memset(&s, 0, sizeof(s));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(0xffffffff);
+ c.directive.numd = sizeof(s);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
+ if (ret)
+ return ret;
+
+ ctrl->nssa = le16_to_cpu(s.nssa);
+ return 0;
+}
+
+/*
+ * Returns number of streams allocated for use by this ns, or -1 on error.
+ */
+static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
+{
+ struct nvme_command c;
+ union nvme_result res;
+ int ret;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+ c.directive.dtype = NVME_DIR_STREAMS;
+ c.directive.endir = streams;
+
+ ret = __nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, &res, NULL, 0, 0,
+ NVME_QID_ANY, 0, 0);
+ if (ret)
+ return -1;
+
+ return le32_to_cpu(res.u32) & 0xffff;
+}
+
+static int nvme_streams_deallocate(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static void nvme_write_hint_work(struct work_struct *work)
+{
+ struct nvme_ns *ns = container_of(work, struct nvme_ns, write_hint_work);
+ int ret, nr_streams;
+
+ if (ns->nr_streams)
+ return;
+
+ nr_streams = streams_per_ns;
+ if (nr_streams > ns->ctrl->nssa)
+ nr_streams = ns->ctrl->nssa;
+
+ ret = nvme_streams_allocate(ns, nr_streams);
+ if (ret <= 0)
+ goto err;
+
+ ns->nr_streams = ret;
+ dev_info(ns->ctrl->device, "successfully enabled %d streams\n", ret);
+ return;
+err:
+ dev_info(ns->ctrl->device, "failed enabling streams\n");
+ ns->ctrl->failed_streams = true;
+}
+
+static void nvme_configure_streams(struct nvme_ns *ns)
+{
+ /*
+ * If we already called this function, we've either marked it
+ * as a failure or set the number of streams.
+ */
+ if (ns->ctrl->failed_streams)
+ return;
+ if (ns->nr_streams)
+ return;
+ schedule_work(&ns->write_hint_work);
+}
+
+static enum rw_hint nvme_get_write_stream(struct nvme_ns *ns,
+ struct request *req)
+{
+ enum rw_hint streamid = req->cmd_flags & REQ_WRITE_LIFE_MASK;
+
+ if (req_op(req) != REQ_OP_WRITE ||
+ !(ns->ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+ return WRITE_LIFE_NONE;
+
+ streamid = req->cmd_flags & REQ_WRITE_LIFE_MASK;
+ if (streamid < ARRAY_SIZE(req->q->write_hints))
+ req->q->write_hints[streamid] += blk_rq_bytes(req) >> 9;
+
+ if (streamid <= ns->nr_streams)
+ return streamid;
+
+ /* for now just round-robin, do something more clever later */
+ return (streamid % ns->nr_streams) + 1;
+}
+
static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
+ enum rw_hint stream;
u16 control = 0;
u32 dsmgmt = 0;
@@ -351,6 +498,20 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+ /*
+ * If we support streams and this request is a write with a valid
+ * hint, then flag it as such. If we haven't allocated streams on
+ * this ns before, do so lazily.
+ */
+ stream = nvme_get_write_stream(ns, req);
+ if (stream != WRITE_LIFE_NONE) {
+ if (ns->nr_streams) {
+ control |= NVME_RW_DTYPE_STREAMS;
+ dsmgmt |= (stream << 16);
+ } else
+ nvme_configure_streams(ns);
+ }
+
if (ns->ms) {
switch (ns->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
@@ -1650,6 +1811,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
dev_pm_qos_hide_latency_tolerance(ctrl->device);
nvme_configure_apst(ctrl);
+ nvme_probe_directives(ctrl);
ctrl->identified = true;
@@ -2049,6 +2211,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
nvme_set_queue_limits(ctrl, ns->queue);
+ INIT_WORK(&ns->write_hint_work, nvme_write_hint_work);
+
sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
if (nvme_revalidate_ns(ns, &id))
@@ -2105,6 +2269,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
return;
+ flush_work(&ns->write_hint_work);
+
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
if (blk_get_integrity(ns->disk))
blk_integrity_unregister(ns->disk);
@@ -2112,6 +2278,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
&nvme_ns_attr_group);
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
+ if (ns->nr_streams)
+ nvme_streams_deallocate(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..918b6126d38b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -118,6 +118,7 @@ enum nvme_ctrl_state {
struct nvme_ctrl {
enum nvme_ctrl_state state;
bool identified;
+ bool failed_streams;
spinlock_t lock;
const struct nvme_ctrl_ops *ops;
struct request_queue *admin_q;
@@ -147,6 +148,7 @@ struct nvme_ctrl {
u16 oncs;
u16 vid;
u16 oacs;
+ u16 nssa;
atomic_t abort_limit;
u8 event_limit;
u8 vwc;
@@ -192,6 +194,7 @@ struct nvme_ns {
u8 uuid[16];
unsigned ns_id;
+ unsigned nr_streams;
int lba_shift;
u16 ms;
bool ext;
@@ -203,6 +206,8 @@ struct nvme_ns {
u64 mode_select_num_blocks;
u32 mode_select_block_len;
+
+ struct work_struct write_hint_work;
};
struct nvme_ctrl_ops {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
+ NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
};
@@ -295,6 +296,19 @@ enum {
};
enum {
+ NVME_DIR_IDENTIFY = 0x00,
+ NVME_DIR_STREAMS = 0x01,
+ NVME_DIR_SND_ID_OP_ENABLE = 0x01,
+ NVME_DIR_SND_ST_OP_REL_ID = 0x01,
+ NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
+ NVME_DIR_RCV_ID_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_STATUS = 0x02,
+ NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
+ NVME_DIR_ENDIR = 0x01,
+};
+
+enum {
NVME_NS_FEAT_THIN = 1 << 0,
NVME_NS_FLBAS_LBA_MASK = 0xf,
NVME_NS_FLBAS_META_EXT = 0x10,
@@ -535,6 +549,7 @@ enum {
NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
+ NVME_RW_DTYPE_STREAMS = 1 << 4,
};
struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
nvme_admin_download_fw = 0x11,
nvme_admin_ns_attach = 0x15,
nvme_admin_keep_alive = 0x18,
+ nvme_admin_directive_send = 0x19,
+ nvme_admin_directive_recv = 0x1a,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
__u32 rsvd14[2];
};
+struct nvme_directive_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __le32 numd;
+ __u8 doper;
+ __u8 dtype;
+ __le16 dspec;
+ __u8 endir;
+ __u8 tdtype;
+ __u16 rsvd15;
+
+ __u32 rsvd16[3];
+};
+
/*
* Fabrics subcommands.
*/
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
__u32 rsvd12[6];
};
+struct streams_directive_params {
+ __u16 msl;
+ __u16 nssa;
+ __u16 nsso;
+ __u8 rsvd[10];
+ __u32 sws;
+ __u16 sgs;
+ __u16 nsa;
+ __u16 nso;
+ __u8 rsvd2[6];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
struct nvmf_property_set_command prop_set;
struct nvmf_property_get_command prop_get;
struct nvme_dbbuf dbbuf;
+ struct nvme_directive_cmd directive;
};
};
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-16 17:24 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
@ 2017-06-16 18:09 ` Christoph Hellwig
2017-06-16 19:41 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-16 18:09 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen
> We default to allocating 4 streams per name space, but it is
> configurable with the 'streams_per_ns' module option.
What's your strategy for multi-namespace devices? You won't even get
your 4 streams for a handful namespaces with devices today or the near
future. Should we cut down the number of streams per namespace? Or
just not set aside streams for namespace exclusive use? Or so far
you simply don't care about the multi-ns case?
> +static void nvme_write_hint_work(struct work_struct *work)
> +{
struct nvme_ns *ns = container_of(work, struct nvme_ns, write_hint_work);
nasty 81 character line :)
> + nr_streams = streams_per_ns;
> + if (nr_streams > ns->ctrl->nssa)
> + nr_streams = ns->ctrl->nssa;
min() ?
> +static enum rw_hint nvme_get_write_stream(struct nvme_ns *ns,
> + struct request *req)
> +{
> + enum rw_hint streamid = req->cmd_flags & REQ_WRITE_LIFE_MASK;
> +
> + if (req_op(req) != REQ_OP_WRITE ||
> + !(ns->ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
> + return WRITE_LIFE_NONE;
How about moving this to the caller?
> + /*
> + * If we support streams and this request is a write with a valid
> + * hint, then flag it as such. If we haven't allocated streams on
> + * this ns before, do so lazily.
> + */
> + stream = nvme_get_write_stream(ns, req);
> + if (stream != WRITE_LIFE_NONE) {
> + if (ns->nr_streams) {
> + control |= NVME_RW_DTYPE_STREAMS;
> + dsmgmt |= (stream << 16);
> + } else
> + nvme_configure_streams(ns);
> + }
.. and instead pass control and dsmgmt to nvme_get_write_stream by
reference to isolate the functionality there. And move the
nvme_configure_streams call into it as well.
Last but not least this will need some tweaks in the deallocate
code due to:
"If the host issues a Dataset Management command to deallocate logical
blocks that are associated with a stream, it should specify a starting
LBA and length that is aligned to and in multiples of the Stream
Granularity Size"
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-16 18:09 ` Christoph Hellwig
@ 2017-06-16 19:41 ` Jens Axboe
2017-06-16 19:56 ` Jens Axboe
2017-06-17 12:21 ` Christoph Hellwig
0 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 19:41 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/16/2017 12:09 PM, Christoph Hellwig wrote:
>> We default to allocating 4 streams per name space, but it is
>> configurable with the 'streams_per_ns' module option.
>
> What's your strategy for multi-namespace devices? You won't even get
> your 4 streams for a handful namespaces with devices today or the near
> future. Should we cut down the number of streams per namespace? Or
> just not set aside streams for namespace exclusive use? Or so far
> you simply don't care about the multi-ns case?
I'm assuming most devices will have 8 streams, or 16. So for 2 or 4
namespaces, we'd be fine.
But honestly, I simply don't care too much about it, for a few reasons:
1) I don't know of anyone that uses name spaces to split up a device,
except for places where the want integrity on one and not the other.
And I haven't even seen that.
2) If you do have multiple name spaces, you get some data separation
through that. Or not, depending on the device.
In any case, we'll just have to divide up the streams. Right now we just
allocate 4 in a first-come first-serve basis. If you have more name
spaces than streams/4, then some name spaces don't get streams. We could
make this more fair and do:
streams_per_ns = ctrl->nssa / num_namespaces;
I'm not sure what the best answer is here, or if there even is one known
righ answer for this, as it will depend on the use case. That's
something we can change down the line without impacting anything, so I'm
not too worried about that.
>> +static void nvme_write_hint_work(struct work_struct *work)
>> +{
> struct nvme_ns *ns = container_of(work, struct nvme_ns, write_hint_work);
>
> nasty 81 character line :)
Dear lord, I'll fix that up.
>> + nr_streams = streams_per_ns;
>> + if (nr_streams > ns->ctrl->nssa)
>> + nr_streams = ns->ctrl->nssa;
>
> min() ?
Yep, let's use that.
>> +static enum rw_hint nvme_get_write_stream(struct nvme_ns *ns,
>> + struct request *req)
>> +{
>> + enum rw_hint streamid = req->cmd_flags & REQ_WRITE_LIFE_MASK;
>> +
>> + if (req_op(req) != REQ_OP_WRITE ||
>> + !(ns->ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
>> + return WRITE_LIFE_NONE;
>
> How about moving this to the caller?
>
>> + /*
>> + * If we support streams and this request is a write with a valid
>> + * hint, then flag it as such. If we haven't allocated streams on
>> + * this ns before, do so lazily.
>> + */
>> + stream = nvme_get_write_stream(ns, req);
>> + if (stream != WRITE_LIFE_NONE) {
>> + if (ns->nr_streams) {
>> + control |= NVME_RW_DTYPE_STREAMS;
>> + dsmgmt |= (stream << 16);
>> + } else
>> + nvme_configure_streams(ns);
>> + }
>
> .. and instead pass control and dsmgmt to nvme_get_write_stream by
> reference to isolate the functionality there. And move the
> nvme_configure_streams call into it as well.
OK, I can make those two changes, fine with me.
> Last but not least this will need some tweaks in the deallocate
> code due to:
>
> "If the host issues a Dataset Management command to deallocate logical
> blocks that are associated with a stream, it should specify a starting
> LBA and length that is aligned to and in multiples of the Stream
> Granularity Size"
Do we use that internally?
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-16 19:41 ` Jens Axboe
@ 2017-06-16 19:56 ` Jens Axboe
2017-06-17 12:21 ` Christoph Hellwig
1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 19:56 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/16/2017 01:41 PM, Jens Axboe wrote:
>> .. and instead pass control and dsmgmt to nvme_get_write_stream by
>> reference to isolate the functionality there. And move the
>> nvme_configure_streams call into it as well.
>
> OK, I can make those two changes, fine with me.
See below, don't want to spam with a new version. Outside of this, are
we in agreeing on that it looks OK now from an interface point of view?
Do we want the enum rw_hint more polished and split up? Right now it's
carried through untouched. I'd rather put that in the flags in the
bio/request for two reasons:
1) We have room for the 4 bits. We can always move it to separate
members if we have to for more hints. I'm always reluctant to add to
either struct, as adding is much easier than removing.
2) If it's in the flags, we get it automatically all the way through the
stack. If it's separate members, we have to touch a bunch of places
to propagate it from bio to request, or from bio to bio when we clone
and split.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 48a5acea5105..d5105eb9dc49 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -422,16 +422,19 @@ static int nvme_streams_deallocate(struct nvme_ns *ns)
static void nvme_write_hint_work(struct work_struct *work)
{
- struct nvme_ns *ns = container_of(work, struct nvme_ns, write_hint_work);
int ret, nr_streams;
+ struct nvme_ns *ns;
+ ns = container_of(work, struct nvme_ns, write_hint_work);
if (ns->nr_streams)
return;
- nr_streams = streams_per_ns;
- if (nr_streams > ns->ctrl->nssa)
- nr_streams = ns->ctrl->nssa;
-
+ /*
+ * For now, always ask for 'streams_per_ns' number of streams. It's
+ * possible that we will then run out of streams if we have many
+ * name spaces. Different policies could be implemented.
+ */
+ nr_streams = min_t(unsigned int, streams_per_ns, ns->ctrl->nssa);
ret = nvme_streams_allocate(ns, nr_streams);
if (ret <= 0)
goto err;
@@ -457,30 +460,41 @@ static void nvme_configure_streams(struct nvme_ns *ns)
schedule_work(&ns->write_hint_work);
}
-static enum rw_hint nvme_get_write_stream(struct nvme_ns *ns,
- struct request *req)
+/*
+ * Check if 'req' has a write hint associated with it. If it does, assign
+ * a valid namespace stream to the write. If we haven't setup streams yet,
+ * kick off configuration and ignore the hints until that has completed.
+ */
+static void nvme_assign_write_stream(struct nvme_ns *ns, struct request *req,
+ u16 *control, u32 *dsmgmt)
{
- enum rw_hint streamid = req->cmd_flags & REQ_WRITE_LIFE_MASK;
+ enum rw_hint streamid;
+
+ streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
+ >> __REQ_WRITE_HINT_SHIFT;
+ if (streamid == WRITE_LIFE_NONE)
+ return;
+
+ if (!ns->nr_streams) {
+ nvme_configure_streams(ns);
+ return;
+ }
- if (req_op(req) != REQ_OP_WRITE ||
- !(ns->ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
- return WRITE_LIFE_NONE;
+ /* for now just round-robin, do something more clever later */
+ if (streamid > ns->nr_streams)
+ streamid = (streamid % ns->nr_streams) + 1;
- streamid = req->cmd_flags & REQ_WRITE_LIFE_MASK;
if (streamid < ARRAY_SIZE(req->q->write_hints))
req->q->write_hints[streamid] += blk_rq_bytes(req) >> 9;
- if (streamid <= ns->nr_streams)
- return streamid;
-
- /* for now just round-robin, do something more clever later */
- return (streamid % ns->nr_streams) + 1;
+ *control |= NVME_RW_DTYPE_STREAMS;
+ *dsmgmt |= streamid << 16;
}
static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
- enum rw_hint stream;
+ struct nvme_ctrl *ctrl = ns->ctrl;
u16 control = 0;
u32 dsmgmt = 0;
@@ -498,19 +512,9 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
- /*
- * If we support streams and this request is a write with a valid
- * hint, then flag it as such. If we haven't allocated streams on
- * this ns before, do so lazily.
- */
- stream = nvme_get_write_stream(ns, req);
- if (stream != WRITE_LIFE_NONE) {
- if (ns->nr_streams) {
- control |= NVME_RW_DTYPE_STREAMS;
- dsmgmt |= (stream << 16);
- } else
- nvme_configure_streams(ns);
- }
+ if (req_op(req) == REQ_OP_WRITE &&
+ (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+ nvme_assign_write_stream(ns, req, &control, &dsmgmt);
if (ns->ms) {
switch (ns->pi_type) {
--
Jens Axboe
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-16 19:41 ` Jens Axboe
2017-06-16 19:56 ` Jens Axboe
@ 2017-06-17 12:21 ` Christoph Hellwig
2017-06-17 14:20 ` Jens Axboe
1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-17 12:21 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger,
martin.petersen
On Fri, Jun 16, 2017 at 01:41:36PM -0600, Jens Axboe wrote:
> But honestly, I simply don't care too much about it, for a few reasons:
>
> 1) I don't know of anyone that uses name spaces to split up a device,
> except for places where the want integrity on one and not the other.
> And I haven't even seen that.
>
> 2) If you do have multiple name spaces, you get some data separation
> through that. Or not, depending on the device.
>
> In any case, we'll just have to divide up the streams. Right now we just
> allocate 4 in a first-come first-serve basis. If you have more name
> spaces than streams/4, then some name spaces don't get streams. We could
> make this more fair and do:
Or just not reserve streams for exclusive use of the namespace and
use them directly from the global pool. Especially if you ever want
to use streams with shared disk filesystems or databases you'd have
to do that anyway. And due to our hardcoded streams values that
will in fact work nicely. And it would get rid of the lazy setup
code as well.
> > "If the host issues a Dataset Management command to deallocate logical
> > blocks that are associated with a stream, it should specify a starting
> > LBA and length that is aligned to and in multiples of the Stream
> > Granularity Size"
>
> Do we use that internally?
I don't understand the comment. If we issue a deallocate (discard)
we should align it to the Stream Granularity Size as exposed by the
device, easy enough. While we're at it we should probably also
expose the Stream Write Size as io_min and the Stream Granularity Size
as io_opt in our I/O topology information while at it.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-17 12:21 ` Christoph Hellwig
@ 2017-06-17 14:20 ` Jens Axboe
2017-06-17 15:03 ` Christoph Hellwig
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 14:20 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/17/2017 06:21 AM, Christoph Hellwig wrote:
> On Fri, Jun 16, 2017 at 01:41:36PM -0600, Jens Axboe wrote:
>> But honestly, I simply don't care too much about it, for a few reasons:
>>
>> 1) I don't know of anyone that uses name spaces to split up a device,
>> except for places where the want integrity on one and not the other.
>> And I haven't even seen that.
>>
>> 2) If you do have multiple name spaces, you get some data separation
>> through that. Or not, depending on the device.
>>
>> In any case, we'll just have to divide up the streams. Right now we just
>> allocate 4 in a first-come first-serve basis. If you have more name
>> spaces than streams/4, then some name spaces don't get streams. We could
>> make this more fair and do:
>
> Or just not reserve streams for exclusive use of the namespace and
> use them directly from the global pool. Especially if you ever want
> to use streams with shared disk filesystems or databases you'd have
> to do that anyway. And due to our hardcoded streams values that
> will in fact work nicely. And it would get rid of the lazy setup
> code as well.
We can certainly go that route. So you'd be fine with allocating 4
streams controller wide by default, and dump the lazy alloc? We can make
this depend on the streams module parameter, so people could turn it
off, if needed.
>>> "If the host issues a Dataset Management command to deallocate logical
>>> blocks that are associated with a stream, it should specify a starting
>>> LBA and length that is aligned to and in multiples of the Stream
>>> Granularity Size"
>>
>> Do we use that internally?
>
> I don't understand the comment. If we issue a deallocate (discard)
> we should align it to the Stream Granularity Size as exposed by the
> device, easy enough. While we're at it we should probably also
> expose the Stream Write Size as io_min and the Stream Granularity Size
> as io_opt in our I/O topology information while at it.
OK, that's easy enough to do, if we enable streams on a controller basis
at probe time.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-17 14:20 ` Jens Axboe
@ 2017-06-17 15:03 ` Christoph Hellwig
2017-06-17 15:11 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-17 15:03 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger,
martin.petersen
On Sat, Jun 17, 2017 at 08:20:01AM -0600, Jens Axboe wrote:
> We can certainly go that route. So you'd be fine with allocating 4
> streams controller wide by default, and dump the lazy alloc? We can make
> this depend on the streams module parameter, so people could turn it
> off, if needed.
We don't even need to allocate the streams - streams are implicitly
allocated on first use:
"Stream resources that are not allocated for the exclusive use of any
namespace are available NVM subsystem stream resources as reported in
NVM Subsystem Streams Available (NSSA) and may be used by any namespace
that has the Streams Directive enabled and has not been allocated
exclusive stream resources in response to an Allocate Resources
operation."
so the only thing you need to do is to enable streams on the namespace,
or just for whole whole controller using nsid=0xffffffff
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-17 15:03 ` Christoph Hellwig
@ 2017-06-17 15:11 ` Jens Axboe
2017-06-17 15:43 ` Jens Axboe
2017-06-19 6:25 ` Christoph Hellwig
0 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 15:11 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/17/2017 09:03 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 08:20:01AM -0600, Jens Axboe wrote:
>> We can certainly go that route. So you'd be fine with allocating 4
>> streams controller wide by default, and dump the lazy alloc? We can make
>> this depend on the streams module parameter, so people could turn it
>> off, if needed.
>
> We don't even need to allocate the streams - streams are implicitly
> allocated on first use:
>
> "Stream resources that are not allocated for the exclusive use of any
> namespace are available NVM subsystem stream resources as reported in
> NVM Subsystem Streams Available (NSSA) and may be used by any namespace
> that has the Streams Directive enabled and has not been allocated
> exclusive stream resources in response to an Allocate Resources
> operation."
>
> so the only thing you need to do is to enable streams on the namespace,
> or just for whole whole controller using nsid=0xffffffff
I have two samples here, and I just tested, and both of them want it
assigned with nsid=0xffffffff or they will fail the writes... So I'd say
we're better off ensuring we do allocate them globally.
For the IO limits, are you sure you want those exposed? SWS is a write
setting, exposing it as io_min may not make sense. For discard,
SWS * SGS * block_size ends up being huge, but I guess at least that
setting does make sense.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-17 15:11 ` Jens Axboe
@ 2017-06-17 15:43 ` Jens Axboe
2017-06-19 6:25 ` Christoph Hellwig
1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 15:43 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/17/2017 09:11 AM, Jens Axboe wrote:
> On 06/17/2017 09:03 AM, Christoph Hellwig wrote:
>> On Sat, Jun 17, 2017 at 08:20:01AM -0600, Jens Axboe wrote:
>>> We can certainly go that route. So you'd be fine with allocating 4
>>> streams controller wide by default, and dump the lazy alloc? We can make
>>> this depend on the streams module parameter, so people could turn it
>>> off, if needed.
>>
>> We don't even need to allocate the streams - streams are implicitly
>> allocated on first use:
>>
>> "Stream resources that are not allocated for the exclusive use of any
>> namespace are available NVM subsystem stream resources as reported in
>> NVM Subsystem Streams Available (NSSA) and may be used by any namespace
>> that has the Streams Directive enabled and has not been allocated
>> exclusive stream resources in response to an Allocate Resources
>> operation."
>>
>> so the only thing you need to do is to enable streams on the namespace,
>> or just for whole whole controller using nsid=0xffffffff
>
> I have two samples here, and I just tested, and both of them want it
> assigned with nsid=0xffffffff or they will fail the writes... So I'd say
> we're better off ensuring we do allocate them globally.
>
> For the IO limits, are you sure you want those exposed? SWS is a write
> setting, exposing it as io_min may not make sense. For discard,
> SWS * SGS * block_size ends up being huge, but I guess at least that
> setting does make sense.
http://git.kernel.dk/cgit/linux-block/commit/?h=write-stream&id=7c36ee9b7da9fdd26a9663870f49a30ca34d36af
I think that should cover all the comments.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-17 15:11 ` Jens Axboe
2017-06-17 15:43 ` Jens Axboe
@ 2017-06-19 6:25 ` Christoph Hellwig
2017-06-19 14:31 ` Jens Axboe
1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19 6:25 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger,
martin.petersen
On Sat, Jun 17, 2017 at 09:11:30AM -0600, Jens Axboe wrote:
> I have two samples here, and I just tested, and both of them want it
> assigned with nsid=0xffffffff or they will fail the writes... So I'd say
> we're better off ensuring we do allocate them globally.
That's clearly against the spec. I'd say go to your vendor and get
a refund, as we Linux folks (Martin and I) fought for this so that
we would not have to do the explicit allocations.
Another quote for from the spec:
"Streams are opened by the controller when the host issues a write
command that specifies a stream identifier that is not currently open.
While a stream is open the controller maintains context for that stream
(e.g., buffers for associated data). The host may determine the streams
that are open using the Get Status operation."
And I think this is very important - otherwise you need to either
allocate the stremas beforehand as your earlier patches (and we take
away the resources from the 99% of the users not using write life
hints), or we need the lazy allocation scheme. And for that to be
efficient it probably needs to be lazy per-stream allocation. That's
why we got the implicit open in after all.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-19 6:25 ` Christoph Hellwig
@ 2017-06-19 14:31 ` Jens Axboe
2017-06-19 14:53 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 14:31 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/19/2017 12:25 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 09:11:30AM -0600, Jens Axboe wrote:
>> I have two samples here, and I just tested, and both of them want it
>> assigned with nsid=0xffffffff or they will fail the writes... So I'd say
>> we're better off ensuring we do allocate them globally.
>
> That's clearly against the spec. I'd say go to your vendor and get
> a refund, as we Linux folks (Martin and I) fought for this so that
> we would not have to do the explicit allocations.
>
> Another quote for from the spec:
>
> "Streams are opened by the controller when the host issues a write
> command that specifies a stream identifier that is not currently open.
> While a stream is open the controller maintains context for that stream
> (e.g., buffers for associated data). The host may determine the streams
> that are open using the Get Status operation."
>
> And I think this is very important - otherwise you need to either
> allocate the stremas beforehand as your earlier patches (and we take
> away the resources from the 99% of the users not using write life
> hints), or we need the lazy allocation scheme. And for that to be
> efficient it probably needs to be lazy per-stream allocation. That's
> why we got the implicit open in after all.
These are just samples, so no refund possible! As you might remember,
I was pretty adamant on not wanting explicit open/close as well, back
in those days. I'll check with the vendor and see what's going on.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-19 14:31 ` Jens Axboe
@ 2017-06-19 14:53 ` Jens Axboe
2017-06-19 18:53 ` Christoph Hellwig
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 14:53 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/19/2017 08:31 AM, Jens Axboe wrote:
> On 06/19/2017 12:25 AM, Christoph Hellwig wrote:
>> On Sat, Jun 17, 2017 at 09:11:30AM -0600, Jens Axboe wrote:
>>> I have two samples here, and I just tested, and both of them want it
>>> assigned with nsid=0xffffffff or they will fail the writes... So I'd say
>>> we're better off ensuring we do allocate them globally.
>>
>> That's clearly against the spec. I'd say go to your vendor and get
>> a refund, as we Linux folks (Martin and I) fought for this so that
>> we would not have to do the explicit allocations.
>>
>> Another quote for from the spec:
>>
>> "Streams are opened by the controller when the host issues a write
>> command that specifies a stream identifier that is not currently open.
>> While a stream is open the controller maintains context for that stream
>> (e.g., buffers for associated data). The host may determine the streams
>> that are open using the Get Status operation."
>>
>> And I think this is very important - otherwise you need to either
>> allocate the stremas beforehand as your earlier patches (and we take
>> away the resources from the 99% of the users not using write life
>> hints), or we need the lazy allocation scheme. And for that to be
>> efficient it probably needs to be lazy per-stream allocation. That's
>> why we got the implicit open in after all.
>
> These are just samples, so no refund possible! As you might remember,
> I was pretty adamant on not wanting explicit open/close as well, back
> in those days. I'll check with the vendor and see what's going on.
Looking at it a bit more closely - there's a difference between
assigning X number of streams (allocating) for use by the subsystem or
per-ns, and having to manually open them. So I don't necessarily think
there's a problem here, neither for us or on the device.
If I load the driver as it currently stands, and get streams params
after load:
Number of Open Streams in NVM subsystem (NOSNS): 0
Allocated Stream Resources (ASR): 4
Number of Open Streams in Namespace (NOSN): 0
4 streams allocated, 0 currently open. Then I generate a single write to
some stream ID:
Number of Open Streams in NVM subsystem (NOSNS): 1
Allocated Stream Resources (ASR): 4
Number of Open Streams in Namespace (NOSN): 1
and we know see a single stream actually open, implicitly, by the
device. Fire off another single stream write, this time to another
stream, and we get:
Number of Open Streams in NVM subsystem (NOSNS): 2
Allocated Stream Resources (ASR): 4
Number of Open Streams in Namespace (NOSN): 2
and stream status correctly reflects that I did a WRITE_HINT_SHORT and
WRITE_HINT_LONG write:
Open Stream Count : 2
Stream Identifier 000001 : 1
Stream Identifier 000002 : 3
Awaiting clarification from the vendor what their position/view of this
is. Reading the spec, I do agree that it leans towards only needing
allocation for a specific name space, but it doesn't explicitly say that
you can use any of the available streams, without allocation, if they
haven't been assigned to a specific name space. I would interpret it
that way, though.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-19 14:53 ` Jens Axboe
@ 2017-06-19 18:53 ` Christoph Hellwig
2017-06-19 19:03 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19 18:53 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger,
martin.petersen
On Mon, Jun 19, 2017 at 08:53:08AM -0600, Jens Axboe wrote:
> Looking at it a bit more closely - there's a difference between
> assigning X number of streams (allocating) for use by the subsystem or
> per-ns, and having to manually open them. So I don't necessarily think
> there's a problem here, neither for us or on the device.
As far as I can tell the allocate resource is only supposed to be
for an individual namespace:
Section 9.3:
"Stream resources may be allocated for the exclusive use of a specified
namespace associated with a particular Host Identifier using the Allocate
Resources operation. Stream resources that are not allocated for the
exclusive use of any namespace are available NVM subsystem stream
resources as reported in NVM Subsystem Streams Available (NSSA) and may
be used by any namespace that has the Streams Directive enabled and has
not been allocated exclusive stream resources in response to an
Allocate Resources operation"
Section 9.3.1.3:
"The Allocate Resources operation indicates the number of streams that the host requests for the exclusive use of the specified namespace."
I think this is pretty clear, but if not it would be good if you could
bring it into the working group so that we can clarify it further.
> Awaiting clarification from the vendor what their position/view of this
> is. Reading the spec, I do agree that it leans towards only needing
> allocation for a specific name space, but it doesn't explicitly say that
> you can use any of the available streams, without allocation, if they
> haven't been assigned to a specific name space. I would interpret it
> that way, though.
Let's get some text into an ECN either way to clarify it..
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-19 18:53 ` Christoph Hellwig
@ 2017-06-19 19:03 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 19:03 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/19/2017 12:53 PM, Christoph Hellwig wrote:
> On Mon, Jun 19, 2017 at 08:53:08AM -0600, Jens Axboe wrote:
>> Looking at it a bit more closely - there's a difference between
>> assigning X number of streams (allocating) for use by the subsystem or
>> per-ns, and having to manually open them. So I don't necessarily think
>> there's a problem here, neither for us or on the device.
>
> As far as I can tell the allocate resource is only supposed to be
> for an individual namespace:
>
> Section 9.3:
>
> "Stream resources may be allocated for the exclusive use of a specified
> namespace associated with a particular Host Identifier using the Allocate
> Resources operation. Stream resources that are not allocated for the
> exclusive use of any namespace are available NVM subsystem stream
> resources as reported in NVM Subsystem Streams Available (NSSA) and may
> be used by any namespace that has the Streams Directive enabled and has
> not been allocated exclusive stream resources in response to an
> Allocate Resources operation"
>
> Section 9.3.1.3:
>
> "The Allocate Resources operation indicates the number of streams that the host requests for the exclusive use of the specified namespace."
>
> I think this is pretty clear, but if not it would be good if you could
> bring it into the working group so that we can clarify it further.
I'll wait until I hear back from the manufacturer, it could just be an
oversight on their end. But yes, the language should be clarified.
>> Awaiting clarification from the vendor what their position/view of this
>> is. Reading the spec, I do agree that it leans towards only needing
>> allocation for a specific name space, but it doesn't explicitly say that
>> you can use any of the available streams, without allocation, if they
>> haven't been assigned to a specific name space. I would interpret it
>> that way, though.
>
> Let's get some text into an ECN either way to clarify it..
Thanks
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 11/11] nvme: add support for streams and directives
2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
2017-06-19 6:35 ` Christoph Hellwig
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data, so
that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and life
time of the device.
We default to allocating 4 streams, controller wide, so we can use them
on all name spaces. This is configurable with the 'streams' module
parameter. If a write stream is set in a write, flag is as such before
sending it to the device.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
drivers/nvme/host/core.c | 170 +++++++++++++++++++++++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 4 ++
include/linux/nvme.h | 48 +++++++++++++
3 files changed, 217 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..637e9514b406 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
module_param(force_apst, bool, 0644);
MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
+static char streams = 4;
+module_param(streams, byte, 0644);
+MODULE_PARM_DESC(streams, "if available, allocate this many streams");
+
static LIST_HEAD(nvme_ctrl_list);
static DEFINE_SPINLOCK(dev_list_lock);
@@ -330,10 +334,125 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_MQ_RQ_QUEUE_OK;
}
+/*
+
+ * Returns number of streams allocated for use by, or -1 on error.
+ */
+static int nvme_streams_allocate(struct nvme_ctrl *ctrl, unsigned int nstreams)
+{
+ struct nvme_command c;
+ union nvme_result res;
+ int ret;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(0xffffffff);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+ c.directive.dtype = NVME_DIR_STREAMS;
+ c.directive.endir = nstreams;
+
+ ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res, NULL, 0, 0,
+ NVME_QID_ANY, 0, 0);
+ if (ret)
+ return -1;
+
+ return le32_to_cpu(res.u32) & 0xffff;
+}
+
+static int nvme_enable_streams(struct nvme_ctrl *ctrl)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(0xffffffff);
+ c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+ c.directive.dtype = NVME_DIR_IDENTIFY;
+ c.directive.tdtype = NVME_DIR_STREAMS;
+ c.directive.endir = NVME_DIR_ENDIR;
+
+ return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_get_stream_params(struct nvme_ctrl *ctrl,
+ struct streams_directive_params *s, u32 nsid)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+ memset(s, 0, sizeof(*s));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(nsid);
+ c.directive.numd = sizeof(*s);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ return nvme_submit_sync_cmd(ctrl->admin_q, &c, s, sizeof(*s));
+}
+
+static int nvme_setup_directives(struct nvme_ctrl *ctrl)
+{
+ struct streams_directive_params s;
+ int ret;
+
+ if (!(ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+ return 0;
+ if (!streams)
+ return 0;
+
+ ret = nvme_enable_streams(ctrl);
+ if (ret)
+ return ret;
+
+ ret = nvme_get_stream_params(ctrl, &s, 0xffffffff);
+ if (ret)
+ return ret;
+
+ ctrl->nssa = le16_to_cpu(s.nssa);
+
+ ret = nvme_streams_allocate(ctrl, min_t(unsigned, streams, ctrl->nssa));
+ if (ret < 0)
+ return ret;
+
+ ctrl->nr_streams = ret;
+ dev_info(ctrl->device, "successfully enabled %d streams\n", ret);
+ return 0;
+}
+
+/*
+ * Check if 'req' has a write hint associated with it. If it does, assign
+ * a valid namespace stream to the write. If we haven't setup streams yet,
+ * kick off configuration and ignore the hints until that has completed.
+ */
+static void nvme_assign_write_stream(struct nvme_ctrl *ctrl,
+ struct request *req, u16 *control,
+ u32 *dsmgmt)
+{
+ enum rw_hint streamid;
+
+ streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
+ >> __REQ_WRITE_HINT_SHIFT;
+ if (streamid == WRITE_LIFE_NONE)
+ return;
+
+ /* for now just round-robin, do something more clever later */
+ if (streamid > ctrl->nr_streams)
+ streamid = (streamid % ctrl->nr_streams) + 1;
+
+ if (streamid < ARRAY_SIZE(req->q->write_hints))
+ req->q->write_hints[streamid] += blk_rq_bytes(req) >> 9;
+
+ *control |= NVME_RW_DTYPE_STREAMS;
+ *dsmgmt |= streamid << 16;
+}
static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
+ struct nvme_ctrl *ctrl = ns->ctrl;
u16 control = 0;
u32 dsmgmt = 0;
@@ -351,6 +470,9 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+ if (req_op(req) == REQ_OP_WRITE && ctrl->nr_streams)
+ nvme_assign_write_stream(ctrl, req, &control, &dsmgmt);
+
if (ns->ms) {
switch (ns->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
@@ -985,14 +1107,23 @@ static void nvme_init_integrity(struct nvme_ns *ns)
static void nvme_config_discard(struct nvme_ns *ns)
{
- struct nvme_ctrl *ctrl = ns->ctrl;
u32 logical_block_size = queue_logical_block_size(ns->queue);
+ struct nvme_ctrl *ctrl = ns->ctrl;
BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
NVME_DSM_MAX_RANGES);
- ns->queue->limits.discard_alignment = logical_block_size;
- ns->queue->limits.discard_granularity = logical_block_size;
+ if (ctrl->nr_streams && ns->sws && ns->sgs) {
+ unsigned int sz = logical_block_size * ns->sws * ns->sgs;
+
+ ns->queue->limits.discard_alignment = sz;
+ ns->queue->limits.discard_granularity = sz;
+ } else {
+ u32 logical_block_size = queue_logical_block_size(ns->queue);
+
+ ns->queue->limits.discard_alignment = logical_block_size;
+ ns->queue->limits.discard_granularity = logical_block_size;
+ }
blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
@@ -1024,6 +1155,7 @@ static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
{
struct nvme_ns *ns = disk->private_data;
+ struct nvme_ctrl *ctrl = ns->ctrl;
u16 bs;
/*
@@ -1037,7 +1169,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
blk_mq_freeze_queue(disk->queue);
- if (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
+ if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
nvme_prep_integrity(disk, id, bs);
blk_queue_logical_block_size(ns->queue, bs);
if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
@@ -1047,7 +1179,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
else
set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
- if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
+ if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
nvme_config_discard(ns);
blk_mq_unfreeze_queue(disk->queue);
}
@@ -1650,6 +1782,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
dev_pm_qos_hide_latency_tolerance(ctrl->device);
nvme_configure_apst(ctrl);
+ nvme_setup_directives(ctrl);
ctrl->identified = true;
@@ -2019,6 +2152,32 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
return ret;
}
+static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
+{
+ struct streams_directive_params s;
+ int ret;
+
+ if (!ctrl->nr_streams)
+ return 0;
+
+ ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+ if (ret)
+ return ret;
+
+ ns->sws = le32_to_cpu(s.sws);
+ ns->sgs = le16_to_cpu(s.sgs);
+
+ if (ns->sws) {
+ unsigned int bs = 1 << ns->lba_shift;
+
+ blk_queue_io_min(ns->queue, bs * ns->sws);
+ if (ns->sgs)
+ blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs);
+ }
+
+ return 0;
+}
+
static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
{
struct nvme_ns *ns;
@@ -2048,6 +2207,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
nvme_set_queue_limits(ctrl, ns->queue);
+ nvme_setup_streams_ns(ctrl, ns);
sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..533f86acd961 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -147,6 +147,8 @@ struct nvme_ctrl {
u16 oncs;
u16 vid;
u16 oacs;
+ u16 nssa;
+ u16 nr_streams;
atomic_t abort_limit;
u8 event_limit;
u8 vwc;
@@ -194,6 +196,8 @@ struct nvme_ns {
unsigned ns_id;
int lba_shift;
u16 ms;
+ u16 sgs;
+ u32 sws;
bool ext;
u8 pi_type;
unsigned long flags;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
+ NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
};
@@ -295,6 +296,19 @@ enum {
};
enum {
+ NVME_DIR_IDENTIFY = 0x00,
+ NVME_DIR_STREAMS = 0x01,
+ NVME_DIR_SND_ID_OP_ENABLE = 0x01,
+ NVME_DIR_SND_ST_OP_REL_ID = 0x01,
+ NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
+ NVME_DIR_RCV_ID_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_STATUS = 0x02,
+ NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
+ NVME_DIR_ENDIR = 0x01,
+};
+
+enum {
NVME_NS_FEAT_THIN = 1 << 0,
NVME_NS_FLBAS_LBA_MASK = 0xf,
NVME_NS_FLBAS_META_EXT = 0x10,
@@ -535,6 +549,7 @@ enum {
NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
+ NVME_RW_DTYPE_STREAMS = 1 << 4,
};
struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
nvme_admin_download_fw = 0x11,
nvme_admin_ns_attach = 0x15,
nvme_admin_keep_alive = 0x18,
+ nvme_admin_directive_send = 0x19,
+ nvme_admin_directive_recv = 0x1a,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
__u32 rsvd14[2];
};
+struct nvme_directive_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __le32 numd;
+ __u8 doper;
+ __u8 dtype;
+ __le16 dspec;
+ __u8 endir;
+ __u8 tdtype;
+ __u16 rsvd15;
+
+ __u32 rsvd16[3];
+};
+
/*
* Fabrics subcommands.
*/
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
__u32 rsvd12[6];
};
+struct streams_directive_params {
+ __u16 msl;
+ __u16 nssa;
+ __u16 nsso;
+ __u8 rsvd[10];
+ __u32 sws;
+ __u16 sgs;
+ __u16 nsa;
+ __u16 nso;
+ __u8 rsvd2[6];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
struct nvmf_property_set_command prop_set;
struct nvmf_property_get_command prop_get;
struct nvme_dbbuf dbbuf;
+ struct nvme_directive_cmd directive;
};
};
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-17 19:59 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
@ 2017-06-19 6:35 ` Christoph Hellwig
2017-06-19 15:04 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19 6:35 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen
Can you add linux-nvme for the next repost?
As said before I think we should rely on implicit streams allocation,
as that will make the whole patch a lot simpler, and it solves the issue
that your current patch will take away your 4 streams from the general
pool on every controller that supports streams, which isn't optimal.
> + streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
> + >> __REQ_WRITE_HINT_SHIFT;
Can we add a helper to convert the REQ_* flags back to the hint next
to where we have the helper to do the reverse one?
> + if (streamid == WRITE_LIFE_NONE)
> + return;
> +
> + /* for now just round-robin, do something more clever later */
> + if (streamid > ctrl->nr_streams)
> + streamid = (streamid % ctrl->nr_streams) + 1;
I thought you were going to fix that to do more intelligent collapsing?
> - ns->queue->limits.discard_granularity = logical_block_size;
> + if (ctrl->nr_streams && ns->sws && ns->sgs) {
> + unsigned int sz = logical_block_size * ns->sws * ns->sgs;
> +
> + ns->queue->limits.discard_alignment = sz;
I don't think this is correct:
"Data that is aligned to and in multiples of the Stream Write Size (SWS)
provides optimal performance of the write commands to the controller.
The Stream Granularity Size indicates the size of the media that is prepared
as a unit for future allocation for write commands and is a multiple of the
Stream Write Size. The controller may allocate and group together a stream
in Stream Granularity Size (SGS) units. Refer to Figure 293."
So I think the ns->sws should go away.
> + ns->queue->limits.discard_granularity = sz;
> + } else {
> + u32 logical_block_size = queue_logical_block_size(ns->queue);
I think we already have a logical_block_size with the same value in
scope here.
> +
> + ns->sws = le32_to_cpu(s.sws);
> + ns->sgs = le16_to_cpu(s.sgs);
> +
> + if (ns->sws) {
> + unsigned int bs = 1 << ns->lba_shift;
> +
> + blk_queue_io_min(ns->queue, bs * ns->sws);
> + if (ns->sgs)
> + blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs);
drop the multiplication with ns->sws here as well.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-19 6:35 ` Christoph Hellwig
@ 2017-06-19 15:04 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 15:04 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/19/2017 12:35 AM, Christoph Hellwig wrote:
> Can you add linux-nvme for the next repost?
>
> As said before I think we should rely on implicit streams allocation,
> as that will make the whole patch a lot simpler, and it solves the issue
> that your current patch will take away your 4 streams from the general
> pool on every controller that supports streams, which isn't optimal.
The only thing it'll change in the patch is the removal of
nvme_streams_allocate(), which is 20 lines of code. So I don't
think it'll simplify things a lot. The patch is already very simple.
But if we don't need to allocate the streams, then of course it should
just go.
>> + streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
>> + >> __REQ_WRITE_HINT_SHIFT;
>
> Can we add a helper to convert the REQ_* flags back to the hint next
> to where we have the helper to do the reverse one?
OK, will od.
>> + if (streamid == WRITE_LIFE_NONE)
>> + return;
>> +
>> + /* for now just round-robin, do something more clever later */
>> + if (streamid > ctrl->nr_streams)
>> + streamid = (streamid % ctrl->nr_streams) + 1;
>
> I thought you were going to fix that to do more intelligent collapsing?
Sure, if you want me to do that now, I can. I propose:
- With 4 streams, direct mapping.
- With 3 streams, collapse two longest life times.
- With 2 streams, collapse short+medium, and long+extreme.
- With 1 stream, don't use streams.
We could potentially still use streams with just 1 stream, since we
have the default of no stream as well. But I don't think it makes
sense at that point.
>
>> - ns->queue->limits.discard_granularity = logical_block_size;
>> + if (ctrl->nr_streams && ns->sws && ns->sgs) {
>> + unsigned int sz = logical_block_size * ns->sws * ns->sgs;
>> +
>> + ns->queue->limits.discard_alignment = sz;
>
> I don't think this is correct:
>
> "Data that is aligned to and in multiples of the Stream Write Size (SWS)
> provides optimal performance of the write commands to the controller.
> The Stream Granularity Size indicates the size of the media that is prepared
> as a unit for future allocation for write commands and is a multiple of the
> Stream Write Size. The controller may allocate and group together a stream
> in Stream Granularity Size (SGS) units. Refer to Figure 293."
>
> So I think the ns->sws should go away.
The SGS value is in units of SWS, however.
>> + ns->queue->limits.discard_granularity = sz;
>> + } else {
>> + u32 logical_block_size = queue_logical_block_size(ns->queue);
>
> I think we already have a logical_block_size with the same value in
> scope here.
Oops yes.
>> + ns->sws = le32_to_cpu(s.sws);
>> + ns->sgs = le16_to_cpu(s.sgs);
>> +
>> + if (ns->sws) {
>> + unsigned int bs = 1 << ns->lba_shift;
>> +
>> + blk_queue_io_min(ns->queue, bs * ns->sws);
>> + if (ns->sgs)
>> + blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs);
>
> drop the multiplication with ns->sws here as well.
See above.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 11/11] nvme: add support for streams and directives
2017-06-15 3:45 [PATCHSET v4] Add support for write life time hints Jens Axboe
@ 2017-06-15 3:45 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-15 3:45 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data,
so that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and
life time of the device.
We default to allocating 4 streams per name space, but it is
configurable with the 'streams_per_ns' module option. If a write stream
is set in a write, flag is as such before sending it to the device. The
streams are allocated lazily - if we get a write request with a life
time hint, then background allocate streams and use them once that
is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
drivers/nvme/host/core.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 5 ++
include/linux/nvme.h | 48 +++++++++++++
3 files changed, 228 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..30a6473b68cc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
module_param(force_apst, bool, 0644);
MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
+
static LIST_HEAD(nvme_ctrl_list);
static DEFINE_SPINLOCK(dev_list_lock);
@@ -331,6 +335,151 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_MQ_RQ_QUEUE_OK;
}
+static int nvme_enable_streams(struct nvme_ctrl *ctrl)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(0xffffffff);
+ c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+ c.directive.dtype = NVME_DIR_IDENTIFY;
+ c.directive.tdtype = NVME_DIR_STREAMS;
+ c.directive.endir = NVME_DIR_ENDIR;
+
+ return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_probe_directives(struct nvme_ctrl *ctrl)
+{
+ struct streams_directive_params s;
+ struct nvme_command c;
+ int ret;
+
+ if (!(ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+ return 0;
+
+ ret = nvme_enable_streams(ctrl);
+ if (ret)
+ return ret;
+
+ memset(&c, 0, sizeof(c));
+ memset(&s, 0, sizeof(s));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(0xffffffff);
+ c.directive.numd = sizeof(s);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
+ if (ret)
+ return ret;
+
+ ctrl->nssa = le16_to_cpu(s.nssa);
+ return 0;
+}
+
+/*
+ * Returns number of streams allocated for use by this ns, or -1 on error.
+ */
+static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
+{
+ struct nvme_command c;
+ union nvme_result res;
+ int ret;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+ c.directive.dtype = NVME_DIR_STREAMS;
+ c.directive.endir = streams;
+
+ ret = __nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, &res, NULL, 0, 0,
+ NVME_QID_ANY, 0, 0);
+ if (ret)
+ return -1;
+
+ return le32_to_cpu(res.u32) & 0xffff;
+}
+
+static int nvme_streams_deallocate(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static void nvme_write_hint_work(struct work_struct *work)
+{
+ struct nvme_ns *ns = container_of(work, struct nvme_ns, write_hint_work);
+ int ret, nr_streams;
+
+ if (ns->nr_streams)
+ return;
+
+ nr_streams = streams_per_ns;
+ if (nr_streams > ns->ctrl->nssa)
+ nr_streams = ns->ctrl->nssa;
+
+ ret = nvme_streams_allocate(ns, nr_streams);
+ if (ret <= 0)
+ goto err;
+
+ ns->nr_streams = ret;
+ dev_info(ns->ctrl->device, "successfully enabled %d streams\n", ret);
+ return;
+err:
+ dev_info(ns->ctrl->device, "failed enabling streams\n");
+ ns->ctrl->failed_streams = true;
+}
+
+static void nvme_configure_streams(struct nvme_ns *ns)
+{
+ /*
+ * If we already called this function, we've either marked it
+ * as a failure or set the number of streams.
+ */
+ if (ns->ctrl->failed_streams)
+ return;
+ if (ns->nr_streams)
+ return;
+ schedule_work(&ns->write_hint_work);
+}
+
+static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
+ struct request *req)
+{
+ unsigned int streamid = 0;
+
+ if (req->cmd_flags & REQ_WRITE_SHORT)
+ streamid = 1;
+ else if (req->cmd_flags & REQ_WRITE_MEDIUM)
+ streamid = 2;
+ else if (req->cmd_flags & REQ_WRITE_LONG)
+ streamid = 3;
+ else if (req->cmd_flags & REQ_WRITE_EXTREME)
+ streamid = 4;
+
+ req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
+
+ if (streamid <= ns->nr_streams)
+ return streamid;
+
+ /* for now just round-robin, do something more clever later */
+ return (streamid % (ns->nr_streams + 1));
+}
+
static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
@@ -351,6 +500,25 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+ /*
+ * If we support streams and it isn't enabled, so so now. Until it's
+ * enabled, we won't flag the write with a stream. If we don't support
+ * streams, just ignore the life time hint.
+ */
+ if (req_op(req) == REQ_OP_WRITE && op_write_hint_valid(req->cmd_flags)) {
+ struct nvme_ctrl *ctrl = ns->ctrl;
+
+ if (ns->nr_streams) {
+ unsigned int stream = nvme_get_write_stream(ns, req);
+
+ if (stream) {
+ control |= NVME_RW_DTYPE_STREAMS;
+ dsmgmt |= (stream << 16);
+ }
+ } else if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+ nvme_configure_streams(ns);
+ }
+
if (ns->ms) {
switch (ns->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
@@ -1650,6 +1818,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
dev_pm_qos_hide_latency_tolerance(ctrl->device);
nvme_configure_apst(ctrl);
+ nvme_probe_directives(ctrl);
ctrl->identified = true;
@@ -2049,6 +2218,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
nvme_set_queue_limits(ctrl, ns->queue);
+ INIT_WORK(&ns->write_hint_work, nvme_write_hint_work);
+
sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
if (nvme_revalidate_ns(ns, &id))
@@ -2105,6 +2276,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
return;
+ flush_work(&ns->write_hint_work);
+
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
if (blk_get_integrity(ns->disk))
blk_integrity_unregister(ns->disk);
@@ -2112,6 +2285,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
&nvme_ns_attr_group);
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
+ if (ns->nr_streams)
+ nvme_streams_deallocate(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..918b6126d38b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -118,6 +118,7 @@ enum nvme_ctrl_state {
struct nvme_ctrl {
enum nvme_ctrl_state state;
bool identified;
+ bool failed_streams;
spinlock_t lock;
const struct nvme_ctrl_ops *ops;
struct request_queue *admin_q;
@@ -147,6 +148,7 @@ struct nvme_ctrl {
u16 oncs;
u16 vid;
u16 oacs;
+ u16 nssa;
atomic_t abort_limit;
u8 event_limit;
u8 vwc;
@@ -192,6 +194,7 @@ struct nvme_ns {
u8 uuid[16];
unsigned ns_id;
+ unsigned nr_streams;
int lba_shift;
u16 ms;
bool ext;
@@ -203,6 +206,8 @@ struct nvme_ns {
u64 mode_select_num_blocks;
u32 mode_select_block_len;
+
+ struct work_struct write_hint_work;
};
struct nvme_ctrl_ops {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
+ NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
};
@@ -295,6 +296,19 @@ enum {
};
enum {
+ NVME_DIR_IDENTIFY = 0x00,
+ NVME_DIR_STREAMS = 0x01,
+ NVME_DIR_SND_ID_OP_ENABLE = 0x01,
+ NVME_DIR_SND_ST_OP_REL_ID = 0x01,
+ NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
+ NVME_DIR_RCV_ID_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_STATUS = 0x02,
+ NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
+ NVME_DIR_ENDIR = 0x01,
+};
+
+enum {
NVME_NS_FEAT_THIN = 1 << 0,
NVME_NS_FLBAS_LBA_MASK = 0xf,
NVME_NS_FLBAS_META_EXT = 0x10,
@@ -535,6 +549,7 @@ enum {
NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
+ NVME_RW_DTYPE_STREAMS = 1 << 4,
};
struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
nvme_admin_download_fw = 0x11,
nvme_admin_ns_attach = 0x15,
nvme_admin_keep_alive = 0x18,
+ nvme_admin_directive_send = 0x19,
+ nvme_admin_directive_recv = 0x1a,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
__u32 rsvd14[2];
};
+struct nvme_directive_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __le32 numd;
+ __u8 doper;
+ __u8 dtype;
+ __le16 dspec;
+ __u8 endir;
+ __u8 tdtype;
+ __u16 rsvd15;
+
+ __u32 rsvd16[3];
+};
+
/*
* Fabrics subcommands.
*/
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
__u32 rsvd12[6];
};
+struct streams_directive_params {
+ __u16 msl;
+ __u16 nssa;
+ __u16 nsso;
+ __u8 rsvd[10];
+ __u32 sws;
+ __u16 sgs;
+ __u16 nsa;
+ __u16 nso;
+ __u8 rsvd2[6];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
struct nvmf_property_set_command prop_set;
struct nvmf_property_get_command prop_get;
struct nvme_dbbuf dbbuf;
+ struct nvme_directive_cmd directive;
};
};
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 11/11] nvme: add support for streams and directives
2017-06-14 19:05 [PATCHSET v3] Add support for write life time hints Jens Axboe
@ 2017-06-14 19:05 ` Jens Axboe
2017-06-14 20:32 ` Christoph Hellwig
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-14 19:05 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data,
so that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and
life time of the device.
We default to allocating 4 streams per name space, but it is
configurable with the 'streams_per_ns' module option. If a write stream
is set in a write, flag is as such before sending it to the device.
Some debug stuff in this patch, dumping streams ID params when
we load nvme.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
drivers/nvme/host/core.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 1 +
include/linux/nvme.h | 48 ++++++++++++++++
3 files changed, 195 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..a389d62a528b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
module_param(force_apst, bool, 0644);
MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
+
static LIST_HEAD(nvme_ctrl_list);
static DEFINE_SPINLOCK(dev_list_lock);
@@ -331,9 +335,34 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_MQ_RQ_QUEUE_OK;
}
+static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
+ struct request *req)
+{
+ unsigned int streamid = 0;
+
+ if (req_op(req) != REQ_OP_WRITE || !blk_stream_valid(req->cmd_flags) ||
+ !ns->nr_streams)
+ return 0;
+
+ if (req->cmd_flags & REQ_WRITE_SHORT)
+ streamid = 1;
+ else if (req->cmd_flags & REQ_WRITE_MEDIUM)
+ streamid = 2;
+ else if (req->cmd_flags & REQ_WRITE_LONG)
+ streamid = 3;
+ else if (req->cmd_flags & REQ_WRITE_EXTREME)
+ streamid = 4;
+
+ if (streamid < BLK_MAX_STREAM)
+ req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
+
+ return (streamid % (ns->nr_streams + 1));
+}
+
static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
+ unsigned int stream;
u16 control = 0;
u32 dsmgmt = 0;
@@ -351,6 +380,12 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+ stream = nvme_get_write_stream(ns, req);
+ if (stream) {
+ control |= NVME_RW_DTYPE_STREAMS;
+ dsmgmt |= (stream << 16);
+ }
+
if (ns->ms) {
switch (ns->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
@@ -1073,6 +1108,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
return 0;
}
+static int nvme_enable_streams(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+ c.directive.dtype = NVME_DIR_IDENTIFY;
+ c.directive.tdtype = NVME_DIR_STREAMS;
+ c.directive.endir = NVME_DIR_ENDIR;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_params(struct nvme_ns *ns)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct streams_directive_params s;
+ struct nvme_command c;
+ int ret;
+
+ memset(&c, 0, sizeof(c));
+ memset(&s, 0, sizeof(s));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.numd = sizeof(s);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
+ if (ret)
+ return ret;
+
+ s.msl = le16_to_cpu(s.msl);
+ s.nssa = le16_to_cpu(s.nssa);
+ s.nsso = le16_to_cpu(s.nsso);
+ s.sws = le32_to_cpu(s.sws);
+ s.sgs = le16_to_cpu(s.sgs);
+ s.nsa = le16_to_cpu(s.nsa);
+ s.nso = le16_to_cpu(s.nso);
+
+ dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
+ "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
+ s.nsso, s.sws, s.sgs, s.nsa, s.nso);
+ return 0;
+}
+
+static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+ c.directive.dtype = NVME_DIR_STREAMS;
+ c.directive.endir = streams;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_deallocate(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static void nvme_config_streams(struct nvme_ns *ns)
+{
+ int ret;
+
+ ret = nvme_enable_streams(ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_params(ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_allocate(ns, streams_per_ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_params(ns);
+ if (ret)
+ return;
+
+ ns->nr_streams = streams_per_ns;
+ dev_info(ns->ctrl->device, "successfully enabled streams\n");
+}
+
static char nvme_pr_type(enum pr_type type)
{
switch (type) {
@@ -1606,6 +1744,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->sgls = le32_to_cpu(id->sgls);
ctrl->kas = le16_to_cpu(id->kas);
+ if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+ dev_info(ctrl->dev, "supports directives\n");
+
ctrl->npss = id->npss;
prev_apsta = ctrl->apsta;
if (ctrl->quirks & NVME_QUIRK_NO_APST) {
@@ -2060,6 +2201,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
goto out_free_id;
}
+ if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+ nvme_config_streams(ns);
+
disk = alloc_disk_node(0, node);
if (!disk)
goto out_free_id;
@@ -2112,6 +2256,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
&nvme_ns_attr_group);
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
+ if (ns->nr_streams)
+ nvme_streams_deallocate(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..dc87c8284259 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -192,6 +192,7 @@ struct nvme_ns {
u8 uuid[16];
unsigned ns_id;
+ unsigned nr_streams;
int lba_shift;
u16 ms;
bool ext;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
+ NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
};
@@ -295,6 +296,19 @@ enum {
};
enum {
+ NVME_DIR_IDENTIFY = 0x00,
+ NVME_DIR_STREAMS = 0x01,
+ NVME_DIR_SND_ID_OP_ENABLE = 0x01,
+ NVME_DIR_SND_ST_OP_REL_ID = 0x01,
+ NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
+ NVME_DIR_RCV_ID_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_STATUS = 0x02,
+ NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
+ NVME_DIR_ENDIR = 0x01,
+};
+
+enum {
NVME_NS_FEAT_THIN = 1 << 0,
NVME_NS_FLBAS_LBA_MASK = 0xf,
NVME_NS_FLBAS_META_EXT = 0x10,
@@ -535,6 +549,7 @@ enum {
NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
+ NVME_RW_DTYPE_STREAMS = 1 << 4,
};
struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
nvme_admin_download_fw = 0x11,
nvme_admin_ns_attach = 0x15,
nvme_admin_keep_alive = 0x18,
+ nvme_admin_directive_send = 0x19,
+ nvme_admin_directive_recv = 0x1a,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
__u32 rsvd14[2];
};
+struct nvme_directive_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __le32 numd;
+ __u8 doper;
+ __u8 dtype;
+ __le16 dspec;
+ __u8 endir;
+ __u8 tdtype;
+ __u16 rsvd15;
+
+ __u32 rsvd16[3];
+};
+
/*
* Fabrics subcommands.
*/
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
__u32 rsvd12[6];
};
+struct streams_directive_params {
+ __u16 msl;
+ __u16 nssa;
+ __u16 nsso;
+ __u8 rsvd[10];
+ __u32 sws;
+ __u16 sgs;
+ __u16 nsa;
+ __u16 nso;
+ __u8 rsvd2[6];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
struct nvmf_property_set_command prop_set;
struct nvmf_property_get_command prop_get;
struct nvme_dbbuf dbbuf;
+ struct nvme_directive_cmd directive;
};
};
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-14 19:05 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
@ 2017-06-14 20:32 ` Christoph Hellwig
2017-06-14 20:43 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-14 20:32 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen
> +static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
> + struct request *req)
> +{
> + unsigned int streamid = 0;
> +
> + if (req_op(req) != REQ_OP_WRITE || !blk_stream_valid(req->cmd_flags) ||
> + !ns->nr_streams)
> + return 0;
Might make more sense to do this check in the caller?
> +
> + if (req->cmd_flags & REQ_WRITE_SHORT)
> + streamid = 1;
> + else if (req->cmd_flags & REQ_WRITE_MEDIUM)
> + streamid = 2;
> + else if (req->cmd_flags & REQ_WRITE_LONG)
> + streamid = 3;
> + else if (req->cmd_flags & REQ_WRITE_EXTREME)
> + streamid = 4;
> +
> + if (streamid < BLK_MAX_STREAM)
Can happen per the index above.
> + req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
> +
> + return (streamid % (ns->nr_streams + 1));
Should we do smarted collapsing? e.g. short + medium and long + extreme
for two? What for three? Does one extra stream make sense in this
scheme?
> + dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
> + "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
> + s.nsso, s.sws, s.sgs, s.nsa, s.nso);
Way to chatty.
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + dev_info(ctrl->dev, "supports directives\n");
Same. Use nvme-cli for that sort of info.
> ctrl->npss = id->npss;
> prev_apsta = ctrl->apsta;
> if (ctrl->quirks & NVME_QUIRK_NO_APST) {
> @@ -2060,6 +2201,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> goto out_free_id;
> }
>
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + nvme_config_streams(ns);
This sets aside four streams on any device that supports them, and
will probably kill performance on them unless you have a workload
that actually uses those streams. I think they need to be allocated
lazily.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-14 20:32 ` Christoph Hellwig
@ 2017-06-14 20:43 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-14 20:43 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen
On 06/14/2017 02:32 PM, Christoph Hellwig wrote:
>> +static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
>> + struct request *req)
>> +{
>> + unsigned int streamid = 0;
>> +
>> + if (req_op(req) != REQ_OP_WRITE || !blk_stream_valid(req->cmd_flags) ||
>> + !ns->nr_streams)
>> + return 0;
>
> Might make more sense to do this check in the caller?
OK, will fix up.
>> + if (req->cmd_flags & REQ_WRITE_SHORT)
>> + streamid = 1;
>> + else if (req->cmd_flags & REQ_WRITE_MEDIUM)
>> + streamid = 2;
>> + else if (req->cmd_flags & REQ_WRITE_LONG)
>> + streamid = 3;
>> + else if (req->cmd_flags & REQ_WRITE_EXTREME)
>> + streamid = 4;
>> +
>> + if (streamid < BLK_MAX_STREAM)
>
> Can happen per the index above.
True, that's a leftover from the previous version.
>> + req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
>> +
>> + return (streamid % (ns->nr_streams + 1));
>
> Should we do smarted collapsing? e.g. short + medium and long + extreme
> for two? What for three? Does one extra stream make sense in this
> scheme?
Collapsing is probably saner than round-robin. I'd tend to collapse on
the longer life time end of things, logically that would make the most
sense.
>> + dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
>> + "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
>> + s.nsso, s.sws, s.sgs, s.nsa, s.nso);
>
> Way to chatty.
Sure, that's mentioned in the changelog, that stuff will go.
>> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
>> + dev_info(ctrl->dev, "supports directives\n");
>
> Same. Use nvme-cli for that sort of info.
Ditto
>> ctrl->npss = id->npss;
>> prev_apsta = ctrl->apsta;
>> if (ctrl->quirks & NVME_QUIRK_NO_APST) {
>> @@ -2060,6 +2201,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>> goto out_free_id;
>> }
>>
>> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
>> + nvme_config_streams(ns);
>
> This sets aside four streams on any device that supports them, and
> will probably kill performance on them unless you have a workload
> that actually uses those streams. I think they need to be allocated
> lazily.
That's a good point, I have been thinking about how best to handle this.
I don't want an API for this, but doing it lazy would be fine. When we
see a write with a life time attached, kick off background setup of
streams. Until that's done, don't use any streams. Once setup, we'll
mark it as we currently do now.
How's that?
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 11/11] nvme: add support for streams and directives
2017-06-13 17:15 [PATCH 0/11] Add support for write life time hints Jens Axboe
@ 2017-06-13 17:15 ` Jens Axboe
2017-06-13 19:47 ` Andreas Dilger
2017-06-13 21:12 ` Andreas Dilger
0 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-13 17:15 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. We default to allocating 4 streams per name space, but
it is configurable with the 'streams_per_ns' module option.
If a write stream is set in a write, flag is as such before
sending it to the device.
Some debug stuff in this patch, dumping streams ID params when
we load nvme.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 1 +
include/linux/nvme.h | 48 ++++++++++++++++++
3 files changed, 173 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..81225e7d4176 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
module_param(force_apst, bool, 0644);
MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
+
static LIST_HEAD(nvme_ctrl_list);
static DEFINE_SPINLOCK(dev_list_lock);
@@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+ if (req_op(req) == REQ_OP_WRITE) {
+ if (bio_stream_valid(req->bio) && ns->streams) {
+ unsigned stream = bio_stream(req->bio) & 0xffff;
+
+ control |= NVME_RW_DTYPE_STREAMS;
+ dsmgmt |= (stream << 16);
+ }
+ }
+
if (ns->ms) {
switch (ns->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
@@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
return 0;
}
+static int nvme_enable_streams(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+ c.directive.dtype = NVME_DIR_IDENTIFY;
+ c.directive.tdtype = NVME_DIR_STREAMS;
+ c.directive.endir = NVME_DIR_ENDIR;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_params(struct nvme_ns *ns)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct streams_directive_params s;
+ struct nvme_command c;
+ int ret;
+
+ memset(&c, 0, sizeof(c));
+ memset(&s, 0, sizeof(s));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.numd = sizeof(s);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
+ if (ret)
+ return ret;
+
+ s.msl = le16_to_cpu(s.msl);
+ s.nssa = le16_to_cpu(s.nssa);
+ s.nsso = le16_to_cpu(s.nsso);
+ s.sws = le32_to_cpu(s.sws);
+ s.sgs = le16_to_cpu(s.sgs);
+ s.nsa = le16_to_cpu(s.nsa);
+ s.nso = le16_to_cpu(s.nso);
+
+ dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
+ "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
+ s.nsso, s.sws, s.sgs, s.nsa, s.nso);
+ return 0;
+}
+
+static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+ c.directive.dtype = NVME_DIR_STREAMS;
+ c.directive.endir = streams;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_deallocate(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static void nvme_config_streams(struct nvme_ns *ns)
+{
+ int ret;
+
+ ret = nvme_enable_streams(ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_params(ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_allocate(ns, streams_per_ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_params(ns);
+ if (ret)
+ return;
+
+ ns->streams = true;
+ dev_info(ns->ctrl->device, "successfully enabled streams\n");
+}
+
static char nvme_pr_type(enum pr_type type)
{
switch (type) {
@@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->sgls = le32_to_cpu(id->sgls);
ctrl->kas = le16_to_cpu(id->kas);
+ if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+ dev_info(ctrl->dev, "supports directives\n");
+
ctrl->npss = id->npss;
prev_apsta = ctrl->apsta;
if (ctrl->quirks & NVME_QUIRK_NO_APST) {
@@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
goto out_free_id;
}
+ if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+ nvme_config_streams(ns);
+
disk = alloc_disk_node(0, node);
if (!disk)
goto out_free_id;
@@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
&nvme_ns_attr_group);
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
+ if (ns->streams)
+ nvme_streams_deallocate(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..c2d8d23c90de 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -195,6 +195,7 @@ struct nvme_ns {
int lba_shift;
u16 ms;
bool ext;
+ bool streams;
u8 pi_type;
unsigned long flags;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
+ NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
};
@@ -295,6 +296,19 @@ enum {
};
enum {
+ NVME_DIR_IDENTIFY = 0x00,
+ NVME_DIR_STREAMS = 0x01,
+ NVME_DIR_SND_ID_OP_ENABLE = 0x01,
+ NVME_DIR_SND_ST_OP_REL_ID = 0x01,
+ NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
+ NVME_DIR_RCV_ID_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_STATUS = 0x02,
+ NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
+ NVME_DIR_ENDIR = 0x01,
+};
+
+enum {
NVME_NS_FEAT_THIN = 1 << 0,
NVME_NS_FLBAS_LBA_MASK = 0xf,
NVME_NS_FLBAS_META_EXT = 0x10,
@@ -535,6 +549,7 @@ enum {
NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
+ NVME_RW_DTYPE_STREAMS = 1 << 4,
};
struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
nvme_admin_download_fw = 0x11,
nvme_admin_ns_attach = 0x15,
nvme_admin_keep_alive = 0x18,
+ nvme_admin_directive_send = 0x19,
+ nvme_admin_directive_recv = 0x1a,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
__u32 rsvd14[2];
};
+struct nvme_directive_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __le32 numd;
+ __u8 doper;
+ __u8 dtype;
+ __le16 dspec;
+ __u8 endir;
+ __u8 tdtype;
+ __u16 rsvd15;
+
+ __u32 rsvd16[3];
+};
+
/*
* Fabrics subcommands.
*/
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
__u32 rsvd12[6];
};
+struct streams_directive_params {
+ __u16 msl;
+ __u16 nssa;
+ __u16 nsso;
+ __u8 rsvd[10];
+ __u32 sws;
+ __u16 sgs;
+ __u16 nsa;
+ __u16 nso;
+ __u8 rsvd2[6];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
struct nvmf_property_set_command prop_set;
struct nvmf_property_get_command prop_get;
struct nvme_dbbuf dbbuf;
+ struct nvme_directive_cmd directive;
};
};
--
2.7.4
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-13 17:15 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
@ 2017-06-13 19:47 ` Andreas Dilger
2017-06-13 20:25 ` Jens Axboe
2017-06-13 21:12 ` Andreas Dilger
1 sibling, 1 reply; 37+ messages in thread
From: Andreas Dilger @ 2017-06-13 19:47 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block
[-- Attachment #1: Type: text/plain, Size: 9843 bytes --]
On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>
> This adds support for Directives in NVMe, particular for the Streams
> directive. We default to allocating 4 streams per name space, but
> it is configurable with the 'streams_per_ns' module option.
>
> If a write stream is set in a write, flag is as such before
> sending it to the device.
>
> Some debug stuff in this patch, dumping streams ID params when
> we load nvme.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 1 +
> include/linux/nvme.h | 48 ++++++++++++++++++
> 3 files changed, 173 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 903d5813023a..81225e7d4176 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -65,6 +65,10 @@ static bool force_apst;
> module_param(force_apst, bool, 0644);
> MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
>
> +static char streams_per_ns = 4;
> +module_param(streams_per_ns, byte, 0644);
> +MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
Are there any limits here? For example, has to be a power-of-two value, or maximum
number per namespace, per device, etc?
> static LIST_HEAD(nvme_ctrl_list);
> static DEFINE_SPINLOCK(dev_list_lock);
>
> @@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
> cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
> cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>
> + if (req_op(req) == REQ_OP_WRITE) {
> + if (bio_stream_valid(req->bio) && ns->streams) {
> + unsigned stream = bio_stream(req->bio) & 0xffff;
> +
> + control |= NVME_RW_DTYPE_STREAMS;
> + dsmgmt |= (stream << 16);
It isn't really clear how this is converted from the 2^16 stream IDs masked
out of 2^32 possible streams in the bio into streams_per_ns = 4 streams per
NVMe device namespace. This appears to be implicitly dependent on upper
layers submitting only 4 distinct WRITE_FILE_* stream IDs, but IMHO that
should be handled transparently at this level. There isn't really any value
to mask the returned bio_stream with 0xffff, since either it will internally
be truncated by streams_per_ns (in which case we don't need to do anything),
or it should be explicitly folded into the accepted range, like:
dsmgmt |= (bio_stream(req->bio) % streams_per_ns) << 16;
or if we want to avoid a 32-bit modulus here, we could pre-compute a mask from
streams_per_ns and enforce that this be a power-of-two value?
Cheers, Andreas
> + }
> + }
> +
> if (ns->ms) {
> switch (ns->pi_type) {
> case NVME_NS_DPS_PI_TYPE3:
> @@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
> return 0;
> }
>
> +static int nvme_enable_streams(struct nvme_ns *ns)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_send;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
> + c.directive.dtype = NVME_DIR_IDENTIFY;
> + c.directive.tdtype = NVME_DIR_STREAMS;
> + c.directive.endir = NVME_DIR_ENDIR;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_params(struct nvme_ns *ns)
> +{
> + struct nvme_ctrl *ctrl = ns->ctrl;
> + struct streams_directive_params s;
> + struct nvme_command c;
> + int ret;
> +
> + memset(&c, 0, sizeof(c));
> + memset(&s, 0, sizeof(s));
> +
> + c.directive.opcode = nvme_admin_directive_recv;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.numd = sizeof(s);
> + c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
> + c.directive.dtype = NVME_DIR_STREAMS;
> +
> + ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
> + if (ret)
> + return ret;
> +
> + s.msl = le16_to_cpu(s.msl);
> + s.nssa = le16_to_cpu(s.nssa);
> + s.nsso = le16_to_cpu(s.nsso);
> + s.sws = le32_to_cpu(s.sws);
> + s.sgs = le16_to_cpu(s.sgs);
> + s.nsa = le16_to_cpu(s.nsa);
> + s.nso = le16_to_cpu(s.nso);
> +
> + dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
> + "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
> + s.nsso, s.sws, s.sgs, s.nsa, s.nso);
> + return 0;
> +}
> +
> +static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_recv;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
> + c.directive.dtype = NVME_DIR_STREAMS;
> + c.directive.endir = streams;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_deallocate(struct nvme_ns *ns)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_send;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
> + c.directive.dtype = NVME_DIR_STREAMS;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static void nvme_config_streams(struct nvme_ns *ns)
> +{
> + int ret;
> +
> + ret = nvme_enable_streams(ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_params(ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_allocate(ns, streams_per_ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_params(ns);
> + if (ret)
> + return;
> +
> + ns->streams = true;
> + dev_info(ns->ctrl->device, "successfully enabled streams\n");
> +}
> +
> static char nvme_pr_type(enum pr_type type)
> {
> switch (type) {
> @@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> ctrl->sgls = le32_to_cpu(id->sgls);
> ctrl->kas = le16_to_cpu(id->kas);
>
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + dev_info(ctrl->dev, "supports directives\n");
> +
> ctrl->npss = id->npss;
> prev_apsta = ctrl->apsta;
> if (ctrl->quirks & NVME_QUIRK_NO_APST) {
> @@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> goto out_free_id;
> }
>
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + nvme_config_streams(ns);
> +
> disk = alloc_disk_node(0, node);
> if (!disk)
> goto out_free_id;
> @@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> &nvme_ns_attr_group);
> if (ns->ndev)
> nvme_nvm_unregister_sysfs(ns);
> + if (ns->streams)
> + nvme_streams_deallocate(ns);
> del_gendisk(ns->disk);
> blk_cleanup_queue(ns->queue);
> }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9d6a070d4391..c2d8d23c90de 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -195,6 +195,7 @@ struct nvme_ns {
> int lba_shift;
> u16 ms;
> bool ext;
> + bool streams;
> u8 pi_type;
> unsigned long flags;
>
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index b625bacf37ef..8b2f5b140134 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -245,6 +245,7 @@ enum {
> NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
> NVME_CTRL_VWC_PRESENT = 1 << 0,
> NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
> + NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
> NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
> };
>
> @@ -295,6 +296,19 @@ enum {
> };
>
> enum {
> + NVME_DIR_IDENTIFY = 0x00,
> + NVME_DIR_STREAMS = 0x01,
> + NVME_DIR_SND_ID_OP_ENABLE = 0x01,
> + NVME_DIR_SND_ST_OP_REL_ID = 0x01,
> + NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
> + NVME_DIR_RCV_ID_OP_PARAM = 0x01,
> + NVME_DIR_RCV_ST_OP_PARAM = 0x01,
> + NVME_DIR_RCV_ST_OP_STATUS = 0x02,
> + NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
> + NVME_DIR_ENDIR = 0x01,
> +};
> +
> +enum {
> NVME_NS_FEAT_THIN = 1 << 0,
> NVME_NS_FLBAS_LBA_MASK = 0xf,
> NVME_NS_FLBAS_META_EXT = 0x10,
> @@ -535,6 +549,7 @@ enum {
> NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
> NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
> NVME_RW_PRINFO_PRACT = 1 << 13,
> + NVME_RW_DTYPE_STREAMS = 1 << 4,
> };
>
> struct nvme_dsm_cmd {
> @@ -604,6 +619,8 @@ enum nvme_admin_opcode {
> nvme_admin_download_fw = 0x11,
> nvme_admin_ns_attach = 0x15,
> nvme_admin_keep_alive = 0x18,
> + nvme_admin_directive_send = 0x19,
> + nvme_admin_directive_recv = 0x1a,
> nvme_admin_dbbuf = 0x7C,
> nvme_admin_format_nvm = 0x80,
> nvme_admin_security_send = 0x81,
> @@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
> __u32 rsvd14[2];
> };
>
> +struct nvme_directive_cmd {
> + __u8 opcode;
> + __u8 flags;
> + __u16 command_id;
> + __le32 nsid;
> + __u64 rsvd2[2];
> + union nvme_data_ptr dptr;
> + __le32 numd;
> + __u8 doper;
> + __u8 dtype;
> + __le16 dspec;
> + __u8 endir;
> + __u8 tdtype;
> + __u16 rsvd15;
> +
> + __u32 rsvd16[3];
> +};
> +
> /*
> * Fabrics subcommands.
> */
> @@ -886,6 +921,18 @@ struct nvme_dbbuf {
> __u32 rsvd12[6];
> };
>
> +struct streams_directive_params {
> + __u16 msl;
> + __u16 nssa;
> + __u16 nsso;
> + __u8 rsvd[10];
> + __u32 sws;
> + __u16 sgs;
> + __u16 nsa;
> + __u16 nso;
> + __u8 rsvd2[6];
> +};
> +
> struct nvme_command {
> union {
> struct nvme_common_command common;
> @@ -906,6 +953,7 @@ struct nvme_command {
> struct nvmf_property_set_command prop_set;
> struct nvmf_property_get_command prop_get;
> struct nvme_dbbuf dbbuf;
> + struct nvme_directive_cmd directive;
> };
> };
>
> --
> 2.7.4
>
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-13 19:47 ` Andreas Dilger
@ 2017-06-13 20:25 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-13 20:25 UTC (permalink / raw)
To: Andreas Dilger; +Cc: linux-fsdevel, linux-block
On 06/13/2017 01:47 PM, Andreas Dilger wrote:
> On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> This adds support for Directives in NVMe, particular for the Streams
>> directive. We default to allocating 4 streams per name space, but
>> it is configurable with the 'streams_per_ns' module option.
>>
>> If a write stream is set in a write, flag is as such before
>> sending it to the device.
>>
>> Some debug stuff in this patch, dumping streams ID params when
>> we load nvme.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>> drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h | 1 +
>> include/linux/nvme.h | 48 ++++++++++++++++++
>> 3 files changed, 173 insertions(+)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 903d5813023a..81225e7d4176 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -65,6 +65,10 @@ static bool force_apst;
>> module_param(force_apst, bool, 0644);
>> MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
>>
>> +static char streams_per_ns = 4;
>> +module_param(streams_per_ns, byte, 0644);
>> +MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
>
> Are there any limits here? For example, has to be a power-of-two value, or maximum
> number per namespace, per device, etc?
It doesn't have to be a power-of-two. Basically the device has X number of
streams available, and you can assign 1..X to a name space. Since we used
four for the API, I just open four here. It's mostly for testing, in reality
we should probably just split the available streams between namespaces and
be done with it. Either we want to use streams, and then all of them, or none
at all.
>> @@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
>> cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
>> cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>>
>> + if (req_op(req) == REQ_OP_WRITE) {
>> + if (bio_stream_valid(req->bio) && ns->streams) {
>> + unsigned stream = bio_stream(req->bio) & 0xffff;
>> +
>> + control |= NVME_RW_DTYPE_STREAMS;
>> + dsmgmt |= (stream << 16);
>
> It isn't really clear how this is converted from the 2^16 stream IDs masked
> out of 2^32 possible streams in the bio into streams_per_ns = 4 streams per
> NVMe device namespace. This appears to be implicitly dependent on upper
> layers submitting only 4 distinct WRITE_FILE_* stream IDs, but IMHO that
> should be handled transparently at this level. There isn't really any value
> to mask the returned bio_stream with 0xffff, since either it will internally
> be truncated by streams_per_ns (in which case we don't need to do anything),
> or it should be explicitly folded into the accepted range, like:
>
> dsmgmt |= (bio_stream(req->bio) % streams_per_ns) << 16;
>
> or if we want to avoid a 32-bit modulus here, we could pre-compute a mask from
> streams_per_ns and enforce that this be a power-of-two value?
Right, and I think we should handle this in the block layer. Have the device
flag how many streams it supports, and just multiplex if we have to. Then the
device can rely on the fact that it never receives a stream ID that's larger
than it supports. Or just MOD it with streams_per_ns like you suggest. I'll
do the latter for now.
The 0xffff mask is typically done in drivers to inform the reader that the
spec has 16-bits set aside for this.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-13 17:15 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
2017-06-13 19:47 ` Andreas Dilger
@ 2017-06-13 21:12 ` Andreas Dilger
2017-06-13 21:18 ` Jens Axboe
1 sibling, 1 reply; 37+ messages in thread
From: Andreas Dilger @ 2017-06-13 21:12 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block
[-- Attachment #1: Type: text/plain, Size: 8513 bytes --]
On Jun 13, 2017, at 11:15 AM, Jens Axboe <axboe@kernel.dk> wrote:
> This adds support for Directives in NVMe, particular for the Streams
> directive. We default to allocating 4 streams per name space, but
> it is configurable with the 'streams_per_ns' module option.
>
> If a write stream is set in a write, flag is as such before
> sending it to the device.
>
> Some debug stuff in this patch, dumping streams ID params when
> we load nvme.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 1 +
> include/linux/nvme.h | 48 ++++++++++++++++++
> 3 files changed, 173 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 903d5813023a..81225e7d4176 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
> return 0;
> }
>
> +static int nvme_enable_streams(struct nvme_ns *ns)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_send;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
> + c.directive.dtype = NVME_DIR_IDENTIFY;
> + c.directive.tdtype = NVME_DIR_STREAMS;
> + c.directive.endir = NVME_DIR_ENDIR;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_params(struct nvme_ns *ns)
> +{
> + struct nvme_ctrl *ctrl = ns->ctrl;
> + struct streams_directive_params s;
> + struct nvme_command c;
> + int ret;
> +
> + memset(&c, 0, sizeof(c));
> + memset(&s, 0, sizeof(s));
Minor nit in a few places in this code - is it more efficient to initialize
these structs at declaration time instead of calling memset(), like:
struct streams_directive_params s = {};
struct nvme_command c = {};
or alternately initializing them fully at declaration time to avoid zeroing
fields that are immediately being set:
struct nvme_command c =
{
.directive.opcode = nvme_admin_directive_recv;
.directive.nsid = cpu_to_le32(ns->ns_id);
.directive.numd = sizeof(s);
.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
.directive.dtype = NVME_DIR_STREAMS;
};
Same for the other functions here.
Cheers, Andreas
> +
> + c.directive.opcode = nvme_admin_directive_recv;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.numd = sizeof(s);
> + c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
> + c.directive.dtype = NVME_DIR_STREAMS;
> +
> + ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
> + if (ret)
> + return ret;
> +
> + s.msl = le16_to_cpu(s.msl);
> + s.nssa = le16_to_cpu(s.nssa);
> + s.nsso = le16_to_cpu(s.nsso);
> + s.sws = le32_to_cpu(s.sws);
> + s.sgs = le16_to_cpu(s.sgs);
> + s.nsa = le16_to_cpu(s.nsa);
> + s.nso = le16_to_cpu(s.nso);
> +
> + dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
> + "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
> + s.nsso, s.sws, s.sgs, s.nsa, s.nso);
> + return 0;
> +}
> +
> +static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_recv;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
> + c.directive.dtype = NVME_DIR_STREAMS;
> + c.directive.endir = streams;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_deallocate(struct nvme_ns *ns)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_send;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
> + c.directive.dtype = NVME_DIR_STREAMS;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static void nvme_config_streams(struct nvme_ns *ns)
> +{
> + int ret;
> +
> + ret = nvme_enable_streams(ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_params(ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_allocate(ns, streams_per_ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_params(ns);
> + if (ret)
> + return;
> +
> + ns->streams = true;
> + dev_info(ns->ctrl->device, "successfully enabled streams\n");
> +}
> +
> static char nvme_pr_type(enum pr_type type)
> {
> switch (type) {
> @@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> ctrl->sgls = le32_to_cpu(id->sgls);
> ctrl->kas = le16_to_cpu(id->kas);
>
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + dev_info(ctrl->dev, "supports directives\n");
> +
> ctrl->npss = id->npss;
> prev_apsta = ctrl->apsta;
> if (ctrl->quirks & NVME_QUIRK_NO_APST) {
> @@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> goto out_free_id;
> }
>
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + nvme_config_streams(ns);
> +
> disk = alloc_disk_node(0, node);
> if (!disk)
> goto out_free_id;
> @@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> &nvme_ns_attr_group);
> if (ns->ndev)
> nvme_nvm_unregister_sysfs(ns);
> + if (ns->streams)
> + nvme_streams_deallocate(ns);
> del_gendisk(ns->disk);
> blk_cleanup_queue(ns->queue);
> }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9d6a070d4391..c2d8d23c90de 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -195,6 +195,7 @@ struct nvme_ns {
> int lba_shift;
> u16 ms;
> bool ext;
> + bool streams;
> u8 pi_type;
> unsigned long flags;
>
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index b625bacf37ef..8b2f5b140134 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -245,6 +245,7 @@ enum {
> NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
> NVME_CTRL_VWC_PRESENT = 1 << 0,
> NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
> + NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
> NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
> };
>
> @@ -295,6 +296,19 @@ enum {
> };
>
> enum {
> + NVME_DIR_IDENTIFY = 0x00,
> + NVME_DIR_STREAMS = 0x01,
> + NVME_DIR_SND_ID_OP_ENABLE = 0x01,
> + NVME_DIR_SND_ST_OP_REL_ID = 0x01,
> + NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
> + NVME_DIR_RCV_ID_OP_PARAM = 0x01,
> + NVME_DIR_RCV_ST_OP_PARAM = 0x01,
> + NVME_DIR_RCV_ST_OP_STATUS = 0x02,
> + NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
> + NVME_DIR_ENDIR = 0x01,
> +};
> +
> +enum {
> NVME_NS_FEAT_THIN = 1 << 0,
> NVME_NS_FLBAS_LBA_MASK = 0xf,
> NVME_NS_FLBAS_META_EXT = 0x10,
> @@ -535,6 +549,7 @@ enum {
> NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
> NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
> NVME_RW_PRINFO_PRACT = 1 << 13,
> + NVME_RW_DTYPE_STREAMS = 1 << 4,
> };
>
> struct nvme_dsm_cmd {
> @@ -604,6 +619,8 @@ enum nvme_admin_opcode {
> nvme_admin_download_fw = 0x11,
> nvme_admin_ns_attach = 0x15,
> nvme_admin_keep_alive = 0x18,
> + nvme_admin_directive_send = 0x19,
> + nvme_admin_directive_recv = 0x1a,
> nvme_admin_dbbuf = 0x7C,
> nvme_admin_format_nvm = 0x80,
> nvme_admin_security_send = 0x81,
> @@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
> __u32 rsvd14[2];
> };
>
> +struct nvme_directive_cmd {
> + __u8 opcode;
> + __u8 flags;
> + __u16 command_id;
> + __le32 nsid;
> + __u64 rsvd2[2];
> + union nvme_data_ptr dptr;
> + __le32 numd;
> + __u8 doper;
> + __u8 dtype;
> + __le16 dspec;
> + __u8 endir;
> + __u8 tdtype;
> + __u16 rsvd15;
> +
> + __u32 rsvd16[3];
> +};
> +
> /*
> * Fabrics subcommands.
> */
> @@ -886,6 +921,18 @@ struct nvme_dbbuf {
> __u32 rsvd12[6];
> };
>
> +struct streams_directive_params {
> + __u16 msl;
> + __u16 nssa;
> + __u16 nsso;
> + __u8 rsvd[10];
> + __u32 sws;
> + __u16 sgs;
> + __u16 nsa;
> + __u16 nso;
> + __u8 rsvd2[6];
> +};
> +
> struct nvme_command {
> union {
> struct nvme_common_command common;
> @@ -906,6 +953,7 @@ struct nvme_command {
> struct nvmf_property_set_command prop_set;
> struct nvmf_property_get_command prop_get;
> struct nvme_dbbuf dbbuf;
> + struct nvme_directive_cmd directive;
> };
> };
>
> --
> 2.7.4
>
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 11/11] nvme: add support for streams and directives
2017-06-13 21:12 ` Andreas Dilger
@ 2017-06-13 21:18 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-13 21:18 UTC (permalink / raw)
To: Andreas Dilger; +Cc: linux-fsdevel, linux-block
On 06/13/2017 03:12 PM, Andreas Dilger wrote:
>> +static int nvme_streams_params(struct nvme_ns *ns)
>> +{
>> + struct nvme_ctrl *ctrl = ns->ctrl;
>> + struct streams_directive_params s;
>> + struct nvme_command c;
>> + int ret;
>> +
>> + memset(&c, 0, sizeof(c));
>> + memset(&s, 0, sizeof(s));
>
> Minor nit in a few places in this code - is it more efficient to initialize
> these structs at declaration time instead of calling memset(), like:
>
> struct streams_directive_params s = {};
> struct nvme_command c = {};
>
> or alternately initializing them fully at declaration time to avoid zeroing
> fields that are immediately being set:
>
> struct nvme_command c =
> {
> .directive.opcode = nvme_admin_directive_recv;
> .directive.nsid = cpu_to_le32(ns->ns_id);
> .directive.numd = sizeof(s);
> .directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
> .directive.dtype = NVME_DIR_STREAMS;
> };
I'm just following the style of the driver for that part.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread