* [PATCH 01/10] block: add support for carrying stream information in a bio
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:01 ` [PATCH 02/10] blk-mq: expose stream write stats through debugfs Jens Axboe
` (9 subsequent siblings)
10 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
No functional changes in this patch, but it allows each bio to
carry a stream ID and disallows merging of bio's with different
stream IDs.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/bio.c | 2 ++
block/blk-merge.c | 14 ++++++++++++++
include/linux/blk_types.h | 17 +++++++++++++++++
3 files changed, 33 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..77f4be1f7428 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -595,6 +595,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_opf = bio_src->bi_opf;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
+ bio->bi_stream = bio_src->bi_stream;
bio_clone_blkcg_association(bio, bio_src);
}
@@ -678,6 +679,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_opf = bio_src->bi_opf;
bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
+ bio->bi_stream = bio_src->bi_stream;
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..28998acdd675 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,13 @@ static struct request *attempt_merge(struct request_queue *q,
return NULL;
/*
+ * Don't allow merge of different streams, or for a stream with
+ * non-stream IO.
+ */
+ if (req->bio->bi_stream != next->bio->bi_stream)
+ return NULL;
+
+ /*
* If we are allowed to merge, then append bio list
* from next to rq and release next. merge_requests_fn
* will have updated segment counts, update sector
@@ -811,6 +818,13 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
!blk_write_same_mergeable(rq->bio, bio))
return false;
+ /*
+ * Don't allow merge of different streams, or for a stream with
+ * non-stream IO.
+ */
+ if (rq->bio->bi_stream != bio->bi_stream)
+ return false;
+
return true;
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..1940876a7fa7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -36,6 +36,8 @@ struct bio {
unsigned short bi_flags; /* status, etc and bvec pool number */
unsigned short bi_ioprio;
+ unsigned int bi_stream; /* write life time hint */
+
struct bvec_iter bi_iter;
/* Number of segments in this BIO after
@@ -312,4 +314,19 @@ struct blk_rq_stat {
u64 batch;
};
+static inline void bio_set_streamid(struct bio *bio, unsigned int stream)
+{
+ bio->bi_stream = stream;
+}
+
+static inline bool bio_stream_valid(struct bio *bio)
+{
+ return bio->bi_stream != 0;
+}
+
+static inline unsigned int bio_stream(struct bio *bio)
+{
+ return bio->bi_stream;
+}
+
#endif /* __LINUX_BLK_TYPES_H */
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 02/10] blk-mq: expose stream write stats through debugfs
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
2017-06-14 4:01 ` [PATCH 01/10] block: add support for carrying stream information in a bio Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:07 ` Andreas Dilger
2017-06-14 4:01 ` [PATCH 03/10] fs: add support for an inode to carry stream related data Jens Axboe
` (8 subsequent siblings)
10 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
Useful to verify that things are working the way they should.
Reading the file will return number of kb written to each
stream. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-core.c | 5 +++++
block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
include/linux/blkdev.h | 3 +++
3 files changed, 32 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index a7421b772d0e..81051e24b5dd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2057,6 +2057,11 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ if (bio_stream(bio) < BLK_MAX_STREAM) {
+ q->stream_writes[bio_stream(bio)] +=
+ bio->bi_iter.bi_size >> 9;
+ }
+
if (likely(blk_queue_enter(q, false) == 0)) {
struct bio_list lower, same;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 803aed4d7221..0a37c848961d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
}
}
+static int queue_streams_show(void *data, struct seq_file *m)
+{
+ struct request_queue *q = data;
+ int i;
+
+ for (i = 0; i < BLK_MAX_STREAM; i++)
+ seq_printf(m, "stream%d: %llu\n", i, q->stream_writes[i]);
+
+ return 0;
+}
+
+static ssize_t queue_streams_store(void *data, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct request_queue *q = data;
+ int i;
+
+ for (i = 0; i < BLK_MAX_STREAM; i++)
+ q->stream_writes[i] = 0;
+
+ return count;
+}
+
static int queue_poll_stat_show(void *data, struct seq_file *m)
{
struct request_queue *q = data;
@@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{"poll_stat", 0400, queue_poll_stat_show},
{"state", 0600, queue_state_show, queue_state_write},
+ {"streams", 0600, queue_streams_show, queue_streams_store},
{},
};
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab92c4ea138b..88719c6f3edf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -586,6 +586,9 @@ struct request_queue {
size_t cmd_size;
void *rq_alloc_data;
+
+#define BLK_MAX_STREAM 5
+ u64 stream_writes[BLK_MAX_STREAM];
};
#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH 02/10] blk-mq: expose stream write stats through debugfs
2017-06-14 4:01 ` [PATCH 02/10] blk-mq: expose stream write stats through debugfs Jens Axboe
@ 2017-06-14 4:07 ` Andreas Dilger
0 siblings, 0 replies; 32+ messages in thread
From: Andreas Dilger @ 2017-06-14 4:07 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block
[-- Attachment #1: Type: text/plain, Size: 2860 bytes --]
On Jun 13, 2017, at 10:01 PM, Jens Axboe <axboe@kernel.dk> wrote:
>
> Useful to verify that things are working the way they should.
> Reading the file will return number of kb written to each
> stream. Writing the file will reset the statistics. No care
> is taken to ensure that we don't race on updates.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
> ---
> block/blk-core.c | 5 +++++
> block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
> include/linux/blkdev.h | 3 +++
> 3 files changed, 32 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a7421b772d0e..81051e24b5dd 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2057,6 +2057,11 @@ blk_qc_t generic_make_request(struct bio *bio)
> do {
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> + if (bio_stream(bio) < BLK_MAX_STREAM) {
> + q->stream_writes[bio_stream(bio)] +=
> + bio->bi_iter.bi_size >> 9;
> + }
> +
> if (likely(blk_queue_enter(q, false) == 0)) {
> struct bio_list lower, same;
>
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 803aed4d7221..0a37c848961d 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
> }
> }
>
> +static int queue_streams_show(void *data, struct seq_file *m)
> +{
> + struct request_queue *q = data;
> + int i;
> +
> + for (i = 0; i < BLK_MAX_STREAM; i++)
> + seq_printf(m, "stream%d: %llu\n", i, q->stream_writes[i]);
> +
> + return 0;
> +}
> +
> +static ssize_t queue_streams_store(void *data, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct request_queue *q = data;
> + int i;
> +
> + for (i = 0; i < BLK_MAX_STREAM; i++)
> + q->stream_writes[i] = 0;
> +
> + return count;
> +}
> +
> static int queue_poll_stat_show(void *data, struct seq_file *m)
> {
> struct request_queue *q = data;
> @@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
> static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
> {"poll_stat", 0400, queue_poll_stat_show},
> {"state", 0600, queue_state_show, queue_state_write},
> + {"streams", 0600, queue_streams_show, queue_streams_store},
> {},
> };
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ab92c4ea138b..88719c6f3edf 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -586,6 +586,9 @@ struct request_queue {
>
> size_t cmd_size;
> void *rq_alloc_data;
> +
> +#define BLK_MAX_STREAM 5
> + u64 stream_writes[BLK_MAX_STREAM];
> };
>
> #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
> --
> 2.7.4
>
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 03/10] fs: add support for an inode to carry stream related data
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
2017-06-14 4:01 ` [PATCH 01/10] block: add support for carrying stream information in a bio Jens Axboe
2017-06-14 4:01 ` [PATCH 02/10] blk-mq: expose stream write stats through debugfs Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:01 ` [PATCH 04/10] fs: add support for allowing applications to pass in write life time hints Jens Axboe
` (7 subsequent siblings)
10 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
No functional changes in this patch, just in preparation for
allowing applications to pass in hints about data life times
for writes.
Pack the i_stream field into a 2-byte hole, so we don't grow
the size of the inode.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/inode.c | 1 +
include/linux/fs.h | 9 +++++++++
2 files changed, 10 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..5717cb378fda 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -149,6 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
inode->i_blocks = 0;
inode->i_bytes = 0;
inode->i_generation = 0;
+ inode->i_stream = 0;
inode->i_pipe = NULL;
inode->i_bdev = NULL;
inode->i_cdev = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..771e172d23d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -591,6 +591,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
+ unsigned short i_stream;
unsigned int i_blkbits;
blkcnt_t i_blocks;
@@ -655,6 +656,14 @@ struct inode {
void *i_private; /* fs or device private pointer */
};
+static inline unsigned int inode_streamid(struct inode *inode)
+{
+ if (inode)
+ return inode->i_stream;
+
+ return 0;
+}
+
static inline unsigned int i_blocksize(const struct inode *node)
{
return (1 << node->i_blkbits);
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 04/10] fs: add support for allowing applications to pass in write life time hints
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (2 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 03/10] fs: add support for an inode to carry stream related data Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:06 ` Andreas Dilger
2017-06-14 4:01 ` [PATCH 05/10] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
` (6 subsequent siblings)
10 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
Add four flags for the pwritev2(2) system call, allowing an application
to give the kernel a hint about what on-media life times can be
expected from a given write.
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Define IOCB flags to carry this information over, and finally
transform them into the block defined stream values.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/read_write.c | 9 ++++++++-
include/linux/fs.h | 12 ++++++++++++
include/uapi/linux/fs.h | 10 ++++++++++
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..a9b0c3125e0f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,7 +678,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
- if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
+ if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
return -EOPNOTSUPP;
init_sync_kiocb(&kiocb, filp);
@@ -688,6 +688,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
kiocb.ki_flags |= IOCB_DSYNC;
if (flags & RWF_SYNC)
kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+ if (flags & RWF_WRITE_LIFE_MASK) {
+ struct inode *inode = file_inode(filp);
+
+ inode->i_stream = (flags & RWF_WRITE_LIFE_MASK) >>
+ RWF_WRITE_LIFE_SHIFT;
+ kiocb.ki_flags |= inode->i_stream << IOCB_WRITE_LIFE_SHIFT;
+ }
kiocb.ki_pos = *ppos;
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 771e172d23d7..751a1046e87b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -269,6 +269,12 @@ struct writeback_control;
#define IOCB_SYNC (1 << 5)
#define IOCB_WRITE (1 << 6)
+/*
+ * Steal 4-bits for stream information, this allows 16 valid streams
+ */
+#define IOCB_WRITE_LIFE_SHIFT 7
+#define IOCB_WRITE_LIFE_MASK (BIT(7) | BIT(8) | BIT(9) | BIT(10))
+
struct kiocb {
struct file *ki_filp;
loff_t ki_pos;
@@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
};
}
+static inline int iocb_streamid(const struct kiocb *iocb)
+{
+ return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
+ IOCB_WRITE_LIFE_SHIFT;
+}
+
/*
* "descriptor" for what we're up to with a read.
* This allows us to use the same read code yet
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..8a7e6f26f6f5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -361,4 +361,14 @@ struct fscrypt_key {
#define RWF_DSYNC 0x00000002 /* per-IO O_DSYNC */
#define RWF_SYNC 0x00000004 /* per-IO O_SYNC */
+/*
+ * Data life time write flags, steal 4 bits for that
+ */
+#define RWF_WRITE_LIFE_SHIFT 3
+#define RWF_WRITE_LIFE_MASK 0x000000f0 /* 4 bits of stream ID */
+#define RWF_WRITE_LIFE_SHORT (1 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_MEDIUM (2 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_LONG (3 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_EXTREME (4 << RWF_WRITE_LIFE_SHIFT)
+
#endif /* _UAPI_LINUX_FS_H */
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH 04/10] fs: add support for allowing applications to pass in write life time hints
2017-06-14 4:01 ` [PATCH 04/10] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-14 4:06 ` Andreas Dilger
0 siblings, 0 replies; 32+ messages in thread
From: Andreas Dilger @ 2017-06-14 4:06 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block
[-- Attachment #1: Type: text/plain, Size: 3571 bytes --]
On Jun 13, 2017, at 10:01 PM, Jens Axboe <axboe@kernel.dk> wrote:
>
> Add four flags for the pwritev2(2) system call, allowing an application
> to give the kernel a hint about what on-media life times can be
> expected from a given write.
>
> The intent is for these values to be relative to each other, no
> absolute meaning should be attached to these flag names.
>
> Define IOCB flags to carry this information over, and finally
> transform them into the block defined stream values.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
> ---
> fs/read_write.c | 9 ++++++++-
> include/linux/fs.h | 12 ++++++++++++
> include/uapi/linux/fs.h | 10 ++++++++++
> 3 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 47c1d4484df9..a9b0c3125e0f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -678,7 +678,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
> struct kiocb kiocb;
> ssize_t ret;
>
> - if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
> + if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
> return -EOPNOTSUPP;
>
> init_sync_kiocb(&kiocb, filp);
> @@ -688,6 +688,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
> kiocb.ki_flags |= IOCB_DSYNC;
> if (flags & RWF_SYNC)
> kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + if (flags & RWF_WRITE_LIFE_MASK) {
> + struct inode *inode = file_inode(filp);
> +
> + inode->i_stream = (flags & RWF_WRITE_LIFE_MASK) >>
> + RWF_WRITE_LIFE_SHIFT;
> + kiocb.ki_flags |= inode->i_stream << IOCB_WRITE_LIFE_SHIFT;
> + }
> kiocb.ki_pos = *ppos;
>
> if (type == READ)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 771e172d23d7..751a1046e87b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -269,6 +269,12 @@ struct writeback_control;
> #define IOCB_SYNC (1 << 5)
> #define IOCB_WRITE (1 << 6)
>
> +/*
> + * Steal 4-bits for stream information, this allows 16 valid streams
> + */
> +#define IOCB_WRITE_LIFE_SHIFT 7
> +#define IOCB_WRITE_LIFE_MASK (BIT(7) | BIT(8) | BIT(9) | BIT(10))
> +
> struct kiocb {
> struct file *ki_filp;
> loff_t ki_pos;
> @@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
> };
> }
>
> +static inline int iocb_streamid(const struct kiocb *iocb)
> +{
> + return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
> + IOCB_WRITE_LIFE_SHIFT;
> +}
> +
> /*
> * "descriptor" for what we're up to with a read.
> * This allows us to use the same read code yet
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 24e61a54feaa..8a7e6f26f6f5 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -361,4 +361,14 @@ struct fscrypt_key {
> #define RWF_DSYNC 0x00000002 /* per-IO O_DSYNC */
> #define RWF_SYNC 0x00000004 /* per-IO O_SYNC */
>
> +/*
> + * Data life time write flags, steal 4 bits for that
> + */
> +#define RWF_WRITE_LIFE_SHIFT 3
> +#define RWF_WRITE_LIFE_MASK 0x000000f0 /* 4 bits of stream ID */
> +#define RWF_WRITE_LIFE_SHORT (1 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_MEDIUM (2 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_LONG (3 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_EXTREME (4 << RWF_WRITE_LIFE_SHIFT)
> +
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.7.4
>
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 05/10] fs: add O_DIRECT support for sending down bio stream information
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (3 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 04/10] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:01 ` [PATCH 06/10] fs: add support for buffered writeback to pass down " Jens Axboe
` (5 subsequent siblings)
10 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/block_dev.c | 2 ++
fs/direct-io.c | 2 ++
fs/iomap.c | 1 +
3 files changed, 5 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..284b8a786283 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -227,6 +227,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
bio.bi_iter.bi_sector = pos >> 9;
bio.bi_private = current;
bio.bi_end_io = blkdev_bio_end_io_simple;
+ bio.bi_stream = iocb_streamid(iocb);
ret = bio_iov_iter_get_pages(&bio, iter);
if (unlikely(ret))
@@ -360,6 +361,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
bio->bi_iter.bi_sector = pos >> 9;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
+ bio->bi_stream = iocb_streamid(iocb);
ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..c9c8b9fd4329 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
+ bio->bi_stream = iocb_streamid(dio->iocb);
+
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
}
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..fa7d29632fbc 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
if (dio->flags & IOMAP_DIO_WRITE) {
bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+ bio_set_streamid(bio, inode_streamid(inode));
task_io_account_write(bio->bi_iter.bi_size);
} else {
bio_set_op_attrs(bio, REQ_OP_READ, 0);
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 06/10] fs: add support for buffered writeback to pass down stream information
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (4 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 05/10] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:01 ` [PATCH 07/10] ext4: add support for passing in stream information for buffered writes Jens Axboe
` (4 subsequent siblings)
10 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/buffer.c | 14 +++++++++-----
fs/mpage.c | 1 +
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..8324c24751ca 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
- struct writeback_control *wbc);
+ unsigned int stream, struct writeback_control *wbc);
#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
- submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+ submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+ inode_streamid(inode), wbc);
nr_underway++;
}
bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
- submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+ submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+ inode_streamid(inode), wbc);
nr_underway++;
}
bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
}
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
- struct writeback_control *wbc)
+ unsigned int stream, struct writeback_control *wbc)
{
struct bio *bio;
@@ -3130,6 +3132,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
/* Take care of bh's that straddle the end of the device */
guard_bio_eod(op, bio);
+ bio_set_streamid(bio, stream);
+
if (buffer_meta(bh))
op_flags |= REQ_META;
if (buffer_prio(bh))
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
int submit_bh(int op, int op_flags, struct buffer_head *bh)
{
- return submit_bh_wbc(op, op_flags, bh, NULL);
+ return submit_bh_wbc(op, op_flags, bh, 0, NULL);
}
EXPORT_SYMBOL(submit_bh);
diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..a9d40c0c053e 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
goto confused;
wbc_init_bio(wbc, bio);
+ bio_set_streamid(bio, inode_streamid(inode));
}
/*
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 07/10] ext4: add support for passing in stream information for buffered writes
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (5 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 06/10] fs: add support for buffered writeback to pass down " Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:01 ` [PATCH 08/10] xfs: " Jens Axboe
` (3 subsequent siblings)
10 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/ext4/page-io.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..033b5bfa4e0b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -350,6 +350,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
REQ_SYNC : 0;
bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
+ bio_set_streamid(io->io_bio, inode_streamid(io->io_end->inode));
submit_bio(io->io_bio);
}
io->io_bio = NULL;
@@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
ret = io_submit_init_bio(io, bh);
if (ret)
return ret;
+ bio_set_streamid(io->io_bio, inode_streamid(inode));
}
ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
if (ret != bh->b_size)
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 08/10] xfs: add support for passing in stream information for buffered writes
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (6 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 07/10] ext4: add support for passing in stream information for buffered writes Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:01 ` [PATCH 09/10] btrfs: " Jens Axboe
` (2 subsequent siblings)
10 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/xfs/xfs_aops.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 09af0f7cd55e..9770be0140ad 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,6 +505,7 @@ xfs_submit_ioend(
return status;
}
+ bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
submit_bio(ioend->io_bio);
return 0;
}
@@ -564,6 +565,7 @@ xfs_chain_bio(
bio_chain(ioend->io_bio, new);
bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+ bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
submit_bio(ioend->io_bio);
ioend->io_bio = new;
}
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 09/10] btrfs: add support for passing in stream information for buffered writes
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (7 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 08/10] xfs: " Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:01 ` [PATCH 10/10] nvme: add support for streams and directives Jens Axboe
2017-06-14 15:45 ` [PATCHSET v2] Add support for write life time hints Martin K. Petersen
10 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
fs/btrfs/extent_io.c | 1 +
fs/btrfs/inode.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..b245085e8f10 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2827,6 +2827,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
bio_set_op_attrs(bio, op, op_flags);
+ bio_set_streamid(bio, inode_streamid(page->mapping->host));
if (wbc) {
wbc_init_bio(wbc, bio);
wbc_account_io(wbc, page, page_size);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef3c98c527c1..db0558a19f65 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8608,6 +8608,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
atomic_set(&dip->pending_bios, 0);
btrfs_bio = btrfs_io_bio(io_bio);
btrfs_bio->logical = file_offset;
+ bio_set_streamid(io_bio, bio_stream(dio_bio));
if (write) {
io_bio->bi_end_io = btrfs_endio_direct_write;
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 10/10] nvme: add support for streams and directives
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (8 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 09/10] btrfs: " Jens Axboe
@ 2017-06-14 4:01 ` Jens Axboe
2017-06-14 4:10 ` Andreas Dilger
2017-06-14 15:45 ` [PATCHSET v2] Add support for write life time hints Martin K. Petersen
10 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 4:01 UTC (permalink / raw)
To: linux-fsdevel, linux-block; +Cc: adilger, Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data,
so that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and
life time of the device.
We default to allocating 4 streams per name space, but it is
configurable with the 'streams_per_ns' module option. If a write stream
is set in a write, flag is as such before sending it to the device.
Some debug stuff in this patch, dumping streams ID params when
we load nvme.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 1 +
include/linux/nvme.h | 48 ++++++++++++++++++
3 files changed, 173 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..d7cbd050ddf4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
module_param(force_apst, bool, 0644);
MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
+
static LIST_HEAD(nvme_ctrl_list);
static DEFINE_SPINLOCK(dev_list_lock);
@@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+ if (req_op(req) == REQ_OP_WRITE) {
+ if (bio_stream_valid(req->bio) && ns->nr_streams) {
+ unsigned stream = bio_stream(req->bio) & 0xffff;
+
+ control |= NVME_RW_DTYPE_STREAMS;
+ dsmgmt |= ((stream % (ns->nr_streams + 1)) << 16);
+ }
+ }
+
if (ns->ms) {
switch (ns->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
@@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
return 0;
}
+static int nvme_enable_streams(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+ c.directive.dtype = NVME_DIR_IDENTIFY;
+ c.directive.tdtype = NVME_DIR_STREAMS;
+ c.directive.endir = NVME_DIR_ENDIR;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_params(struct nvme_ns *ns)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct streams_directive_params s;
+ struct nvme_command c;
+ int ret;
+
+ memset(&c, 0, sizeof(c));
+ memset(&s, 0, sizeof(s));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.numd = sizeof(s);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
+ if (ret)
+ return ret;
+
+ s.msl = le16_to_cpu(s.msl);
+ s.nssa = le16_to_cpu(s.nssa);
+ s.nsso = le16_to_cpu(s.nsso);
+ s.sws = le32_to_cpu(s.sws);
+ s.sgs = le16_to_cpu(s.sgs);
+ s.nsa = le16_to_cpu(s.nsa);
+ s.nso = le16_to_cpu(s.nso);
+
+ dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
+ "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
+ s.nsso, s.sws, s.sgs, s.nsa, s.nso);
+ return 0;
+}
+
+static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_recv;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+ c.directive.dtype = NVME_DIR_STREAMS;
+ c.directive.endir = streams;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_streams_deallocate(struct nvme_ns *ns)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.directive.opcode = nvme_admin_directive_send;
+ c.directive.nsid = cpu_to_le32(ns->ns_id);
+ c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
+ c.directive.dtype = NVME_DIR_STREAMS;
+
+ return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static void nvme_config_streams(struct nvme_ns *ns)
+{
+ int ret;
+
+ ret = nvme_enable_streams(ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_params(ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_allocate(ns, streams_per_ns);
+ if (ret)
+ return;
+
+ ret = nvme_streams_params(ns);
+ if (ret)
+ return;
+
+ ns->nr_streams = streams_per_ns;
+ dev_info(ns->ctrl->device, "successfully enabled streams\n");
+}
+
static char nvme_pr_type(enum pr_type type)
{
switch (type) {
@@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->sgls = le32_to_cpu(id->sgls);
ctrl->kas = le16_to_cpu(id->kas);
+ if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+ dev_info(ctrl->dev, "supports directives\n");
+
ctrl->npss = id->npss;
prev_apsta = ctrl->apsta;
if (ctrl->quirks & NVME_QUIRK_NO_APST) {
@@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
goto out_free_id;
}
+ if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+ nvme_config_streams(ns);
+
disk = alloc_disk_node(0, node);
if (!disk)
goto out_free_id;
@@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
&nvme_ns_attr_group);
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
+ if (ns->nr_streams)
+ nvme_streams_deallocate(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..dc87c8284259 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -192,6 +192,7 @@ struct nvme_ns {
u8 uuid[16];
unsigned ns_id;
+ unsigned nr_streams;
int lba_shift;
u16 ms;
bool ext;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
+ NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
};
@@ -295,6 +296,19 @@ enum {
};
enum {
+ NVME_DIR_IDENTIFY = 0x00,
+ NVME_DIR_STREAMS = 0x01,
+ NVME_DIR_SND_ID_OP_ENABLE = 0x01,
+ NVME_DIR_SND_ST_OP_REL_ID = 0x01,
+ NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
+ NVME_DIR_RCV_ID_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_PARAM = 0x01,
+ NVME_DIR_RCV_ST_OP_STATUS = 0x02,
+ NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
+ NVME_DIR_ENDIR = 0x01,
+};
+
+enum {
NVME_NS_FEAT_THIN = 1 << 0,
NVME_NS_FLBAS_LBA_MASK = 0xf,
NVME_NS_FLBAS_META_EXT = 0x10,
@@ -535,6 +549,7 @@ enum {
NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
+ NVME_RW_DTYPE_STREAMS = 1 << 4,
};
struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
nvme_admin_download_fw = 0x11,
nvme_admin_ns_attach = 0x15,
nvme_admin_keep_alive = 0x18,
+ nvme_admin_directive_send = 0x19,
+ nvme_admin_directive_recv = 0x1a,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
__u32 rsvd14[2];
};
+struct nvme_directive_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __le32 numd;
+ __u8 doper;
+ __u8 dtype;
+ __le16 dspec;
+ __u8 endir;
+ __u8 tdtype;
+ __u16 rsvd15;
+
+ __u32 rsvd16[3];
+};
+
/*
* Fabrics subcommands.
*/
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
__u32 rsvd12[6];
};
+struct streams_directive_params {
+ __u16 msl;
+ __u16 nssa;
+ __u16 nsso;
+ __u8 rsvd[10];
+ __u32 sws;
+ __u16 sgs;
+ __u16 nsa;
+ __u16 nso;
+ __u8 rsvd2[6];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
struct nvmf_property_set_command prop_set;
struct nvmf_property_get_command prop_get;
struct nvme_dbbuf dbbuf;
+ struct nvme_directive_cmd directive;
};
};
--
2.7.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH 10/10] nvme: add support for streams and directives
2017-06-14 4:01 ` [PATCH 10/10] nvme: add support for streams and directives Jens Axboe
@ 2017-06-14 4:10 ` Andreas Dilger
0 siblings, 0 replies; 32+ messages in thread
From: Andreas Dilger @ 2017-06-14 4:10 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block
[-- Attachment #1: Type: text/plain, Size: 9536 bytes --]
On Jun 13, 2017, at 10:01 PM, Jens Axboe <axboe@kernel.dk> wrote:
>
> This adds support for Directives in NVMe, particular for the Streams
> directive. Support for Directives is a new feature in NVMe 1.3. It
> allows a user to pass in information about where to store the data,
> so that it the device can do so most effiently. If an application is
> managing and writing data with different life times, mixing differently
> retentioned data onto the same locations on flash can cause write
> amplification to grow. This, in turn, will reduce performance and
> life time of the device.
>
> We default to allocating 4 streams per name space, but it is
> configurable with the 'streams_per_ns' module option. If a write stream
> is set in a write, flag is as such before sending it to the device.
>
> Some debug stuff in this patch, dumping streams ID params when
> we load nvme.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
You could add my Reviewed-by: but I don't know anything about the NVMe
functionality. The stream ID handling looks good now, in an case.
Cheers, Andreas
> ---
> drivers/nvme/host/core.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 1 +
> include/linux/nvme.h | 48 ++++++++++++++++++
> 3 files changed, 173 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 903d5813023a..d7cbd050ddf4 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -65,6 +65,10 @@ static bool force_apst;
> module_param(force_apst, bool, 0644);
> MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
>
> +static char streams_per_ns = 4;
> +module_param(streams_per_ns, byte, 0644);
> +MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
> +
> static LIST_HEAD(nvme_ctrl_list);
> static DEFINE_SPINLOCK(dev_list_lock);
>
> @@ -351,6 +355,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
> cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
> cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>
> + if (req_op(req) == REQ_OP_WRITE) {
> + if (bio_stream_valid(req->bio) && ns->nr_streams) {
> + unsigned stream = bio_stream(req->bio) & 0xffff;
> +
> + control |= NVME_RW_DTYPE_STREAMS;
> + dsmgmt |= ((stream % (ns->nr_streams + 1)) << 16);
> + }
> + }
> +
> if (ns->ms) {
> switch (ns->pi_type) {
> case NVME_NS_DPS_PI_TYPE3:
> @@ -1073,6 +1086,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
> return 0;
> }
>
> +static int nvme_enable_streams(struct nvme_ns *ns)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_send;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
> + c.directive.dtype = NVME_DIR_IDENTIFY;
> + c.directive.tdtype = NVME_DIR_STREAMS;
> + c.directive.endir = NVME_DIR_ENDIR;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_params(struct nvme_ns *ns)
> +{
> + struct nvme_ctrl *ctrl = ns->ctrl;
> + struct streams_directive_params s;
> + struct nvme_command c;
> + int ret;
> +
> + memset(&c, 0, sizeof(c));
> + memset(&s, 0, sizeof(s));
> +
> + c.directive.opcode = nvme_admin_directive_recv;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.numd = sizeof(s);
> + c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
> + c.directive.dtype = NVME_DIR_STREAMS;
> +
> + ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
> + if (ret)
> + return ret;
> +
> + s.msl = le16_to_cpu(s.msl);
> + s.nssa = le16_to_cpu(s.nssa);
> + s.nsso = le16_to_cpu(s.nsso);
> + s.sws = le32_to_cpu(s.sws);
> + s.sgs = le16_to_cpu(s.sgs);
> + s.nsa = le16_to_cpu(s.nsa);
> + s.nso = le16_to_cpu(s.nso);
> +
> + dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
> + "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
> + s.nsso, s.sws, s.sgs, s.nsa, s.nso);
> + return 0;
> +}
> +
> +static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_recv;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
> + c.directive.dtype = NVME_DIR_STREAMS;
> + c.directive.endir = streams;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static int nvme_streams_deallocate(struct nvme_ns *ns)
> +{
> + struct nvme_command c;
> +
> + memset(&c, 0, sizeof(c));
> +
> + c.directive.opcode = nvme_admin_directive_send;
> + c.directive.nsid = cpu_to_le32(ns->ns_id);
> + c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
> + c.directive.dtype = NVME_DIR_STREAMS;
> +
> + return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
> +}
> +
> +static void nvme_config_streams(struct nvme_ns *ns)
> +{
> + int ret;
> +
> + ret = nvme_enable_streams(ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_params(ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_allocate(ns, streams_per_ns);
> + if (ret)
> + return;
> +
> + ret = nvme_streams_params(ns);
> + if (ret)
> + return;
> +
> + ns->nr_streams = streams_per_ns;
> + dev_info(ns->ctrl->device, "successfully enabled streams\n");
> +}
> +
> static char nvme_pr_type(enum pr_type type)
> {
> switch (type) {
> @@ -1606,6 +1722,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> ctrl->sgls = le32_to_cpu(id->sgls);
> ctrl->kas = le16_to_cpu(id->kas);
>
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + dev_info(ctrl->dev, "supports directives\n");
> +
> ctrl->npss = id->npss;
> prev_apsta = ctrl->apsta;
> if (ctrl->quirks & NVME_QUIRK_NO_APST) {
> @@ -2060,6 +2179,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> goto out_free_id;
> }
>
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + nvme_config_streams(ns);
> +
> disk = alloc_disk_node(0, node);
> if (!disk)
> goto out_free_id;
> @@ -2112,6 +2234,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> &nvme_ns_attr_group);
> if (ns->ndev)
> nvme_nvm_unregister_sysfs(ns);
> + if (ns->nr_streams)
> + nvme_streams_deallocate(ns);
> del_gendisk(ns->disk);
> blk_cleanup_queue(ns->queue);
> }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9d6a070d4391..dc87c8284259 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -192,6 +192,7 @@ struct nvme_ns {
> u8 uuid[16];
>
> unsigned ns_id;
> + unsigned nr_streams;
> int lba_shift;
> u16 ms;
> bool ext;
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index b625bacf37ef..8b2f5b140134 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -245,6 +245,7 @@ enum {
> NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
> NVME_CTRL_VWC_PRESENT = 1 << 0,
> NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
> + NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
> NVME_CTRL_OACS_DBBUF_SUPP = 1 << 7,
> };
>
> @@ -295,6 +296,19 @@ enum {
> };
>
> enum {
> + NVME_DIR_IDENTIFY = 0x00,
> + NVME_DIR_STREAMS = 0x01,
> + NVME_DIR_SND_ID_OP_ENABLE = 0x01,
> + NVME_DIR_SND_ST_OP_REL_ID = 0x01,
> + NVME_DIR_SND_ST_OP_REL_RSC = 0x02,
> + NVME_DIR_RCV_ID_OP_PARAM = 0x01,
> + NVME_DIR_RCV_ST_OP_PARAM = 0x01,
> + NVME_DIR_RCV_ST_OP_STATUS = 0x02,
> + NVME_DIR_RCV_ST_OP_RESOURCE = 0x03,
> + NVME_DIR_ENDIR = 0x01,
> +};
> +
> +enum {
> NVME_NS_FEAT_THIN = 1 << 0,
> NVME_NS_FLBAS_LBA_MASK = 0xf,
> NVME_NS_FLBAS_META_EXT = 0x10,
> @@ -535,6 +549,7 @@ enum {
> NVME_RW_PRINFO_PRCHK_APP = 1 << 11,
> NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
> NVME_RW_PRINFO_PRACT = 1 << 13,
> + NVME_RW_DTYPE_STREAMS = 1 << 4,
> };
>
> struct nvme_dsm_cmd {
> @@ -604,6 +619,8 @@ enum nvme_admin_opcode {
> nvme_admin_download_fw = 0x11,
> nvme_admin_ns_attach = 0x15,
> nvme_admin_keep_alive = 0x18,
> + nvme_admin_directive_send = 0x19,
> + nvme_admin_directive_recv = 0x1a,
> nvme_admin_dbbuf = 0x7C,
> nvme_admin_format_nvm = 0x80,
> nvme_admin_security_send = 0x81,
> @@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
> __u32 rsvd14[2];
> };
>
> +struct nvme_directive_cmd {
> + __u8 opcode;
> + __u8 flags;
> + __u16 command_id;
> + __le32 nsid;
> + __u64 rsvd2[2];
> + union nvme_data_ptr dptr;
> + __le32 numd;
> + __u8 doper;
> + __u8 dtype;
> + __le16 dspec;
> + __u8 endir;
> + __u8 tdtype;
> + __u16 rsvd15;
> +
> + __u32 rsvd16[3];
> +};
> +
> /*
> * Fabrics subcommands.
> */
> @@ -886,6 +921,18 @@ struct nvme_dbbuf {
> __u32 rsvd12[6];
> };
>
> +struct streams_directive_params {
> + __u16 msl;
> + __u16 nssa;
> + __u16 nsso;
> + __u8 rsvd[10];
> + __u32 sws;
> + __u16 sgs;
> + __u16 nsa;
> + __u16 nso;
> + __u8 rsvd2[6];
> +};
> +
> struct nvme_command {
> union {
> struct nvme_common_command common;
> @@ -906,6 +953,7 @@ struct nvme_command {
> struct nvmf_property_set_command prop_set;
> struct nvmf_property_get_command prop_get;
> struct nvme_dbbuf dbbuf;
> + struct nvme_directive_cmd directive;
> };
> };
>
> --
> 2.7.4
>
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 4:01 [PATCHSET v2] Add support for write life time hints Jens Axboe
` (9 preceding siblings ...)
2017-06-14 4:01 ` [PATCH 10/10] nvme: add support for streams and directives Jens Axboe
@ 2017-06-14 15:45 ` Martin K. Petersen
2017-06-14 15:53 ` Jens Axboe
10 siblings, 1 reply; 32+ messages in thread
From: Martin K. Petersen @ 2017-06-14 15:45 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger
Jens,
> A new iteration of this patchset, previously known as write streams.
> As before, this patchset aims at enabling applications split up
> writes into separate streams, based on the perceived life time
> of the data written. This is useful for a variety of reasons:
>
> - With NVMe 1.3 compliant devices, the device can expose multiple
> streams. Separating data written into streams based on life time
> can drastically reduce the write amplification. This helps device
> endurance, and increases performance. Testing just performed
> internally at Facebook with these patches showed up to a 25%
> reduction in NAND writes in a RocksDB setup.
>
> - Software caching solutions can make more intelligent decisions
> on how and where to place data.
>
> Contrary to previous patches, we're not exposing numeric stream values
> anymore. I've previously advocated for just doing a set of hints that
> makes sense instead. See the coverage from the LSFMM summit this year:
I am all for having these write hints. But one request I would like to
make is that we just make them flags and abolish all notions of the term
"streams" from block for this particular use case (since it is more
hinty than streamy).
There are devices coming where a proper stream ID is prerequisite to
separate data streams for other reasons than bucketing based on data
lifetime. So my preference would be that we make the lifetime hints be
flags in block. That does not preclude using Streams Directives to
implement them in the NVMe NAND flash case. But it does not cause
conflicts with the use cases that need "proper" stream IDs for QoS or
colocation avoidance purposes in SCSI.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 15:45 ` [PATCHSET v2] Add support for write life time hints Martin K. Petersen
@ 2017-06-14 15:53 ` Jens Axboe
2017-06-14 16:00 ` Martin K. Petersen
2017-06-14 16:01 ` Christoph Hellwig
0 siblings, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 15:53 UTC (permalink / raw)
To: Martin K. Petersen; +Cc: linux-fsdevel, linux-block, adilger
On 06/14/2017 09:45 AM, Martin K. Petersen wrote:
>
> Jens,
>
>> A new iteration of this patchset, previously known as write streams.
>> As before, this patchset aims at enabling applications split up
>> writes into separate streams, based on the perceived life time
>> of the data written. This is useful for a variety of reasons:
>>
>> - With NVMe 1.3 compliant devices, the device can expose multiple
>> streams. Separating data written into streams based on life time
>> can drastically reduce the write amplification. This helps device
>> endurance, and increases performance. Testing just performed
>> internally at Facebook with these patches showed up to a 25%
>> reduction in NAND writes in a RocksDB setup.
>>
>> - Software caching solutions can make more intelligent decisions
>> on how and where to place data.
>>
>> Contrary to previous patches, we're not exposing numeric stream values
>> anymore. I've previously advocated for just doing a set of hints that
>> makes sense instead. See the coverage from the LSFMM summit this year:
>
> I am all for having these write hints. But one request I would like to
> make is that we just make them flags and abolish all notions of the term
> "streams" from block for this particular use case (since it is more
> hinty than streamy).
>
> There are devices coming where a proper stream ID is prerequisite to
> separate data streams for other reasons than bucketing based on data
> lifetime. So my preference would be that we make the lifetime hints be
> flags in block. That does not preclude using Streams Directives to
> implement them in the NVMe NAND flash case. But it does not cause
> conflicts with the use cases that need "proper" stream IDs for QoS or
> colocation avoidance purposes in SCSI.
So how about we just call it "write_hint"? It sounds mostly like a
naming issue to me, as you would then map that to some specific stream
in your driver. You're free to do that right now. They are all flags,
it's just packed as a value to not waste too much space.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 15:53 ` Jens Axboe
@ 2017-06-14 16:00 ` Martin K. Petersen
2017-06-14 16:03 ` Jens Axboe
2017-06-14 16:01 ` Christoph Hellwig
1 sibling, 1 reply; 32+ messages in thread
From: Martin K. Petersen @ 2017-06-14 16:00 UTC (permalink / raw)
To: Jens Axboe; +Cc: Martin K. Petersen, linux-fsdevel, linux-block, adilger
Jens,
> So how about we just call it "write_hint"? It sounds mostly like a
> naming issue to me, as you would then map that to some specific stream
> in your driver. You're free to do that right now. They are all flags,
> it's just packed as a value to not waste too much space.
Sure, that's fine with me. But let's call them bi_hints or something. I
have a couple that I would like to add that are I/O direction agnostic.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 16:00 ` Martin K. Petersen
@ 2017-06-14 16:03 ` Jens Axboe
0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 16:03 UTC (permalink / raw)
To: Martin K. Petersen; +Cc: linux-fsdevel, linux-block, adilger
On 06/14/2017 10:00 AM, Martin K. Petersen wrote:
>
> Jens,
>
>> So how about we just call it "write_hint"? It sounds mostly like a
>> naming issue to me, as you would then map that to some specific stream
>> in your driver. You're free to do that right now. They are all flags,
>> it's just packed as a value to not waste too much space.
>
> Sure, that's fine with me. But let's call them bi_hints or something. I
> have a couple that I would like to add that are I/O direction agnostic.
Assuming you saw Christoph's reply, if we just make then bio/req flags,
you can translate them into whatever you want. Work for you?
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 15:53 ` Jens Axboe
2017-06-14 16:00 ` Martin K. Petersen
@ 2017-06-14 16:01 ` Christoph Hellwig
2017-06-14 16:02 ` Jens Axboe
2017-06-14 16:04 ` Martin K. Petersen
1 sibling, 2 replies; 32+ messages in thread
From: Christoph Hellwig @ 2017-06-14 16:01 UTC (permalink / raw)
To: Jens Axboe; +Cc: Martin K. Petersen, linux-fsdevel, linux-block, adilger
On Wed, Jun 14, 2017 at 09:53:05AM -0600, Jens Axboe wrote:
> So how about we just call it "write_hint"? It sounds mostly like a
> naming issue to me, as you would then map that to some specific stream
> in your driver. You're free to do that right now. They are all flags,
> it's just packed as a value to not waste too much space.
I think what Martin wants (or at least what I'd want him to want) is
to define a few REQ_* bits that mirror the RWF bits, use that to
transfer the information down the stack, and then only translate it
to stream ids in the driver.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 16:01 ` Christoph Hellwig
@ 2017-06-14 16:02 ` Jens Axboe
2017-06-14 16:04 ` Martin K. Petersen
1 sibling, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 16:02 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Martin K. Petersen, linux-fsdevel, linux-block, adilger
On 06/14/2017 10:01 AM, Christoph Hellwig wrote:
> On Wed, Jun 14, 2017 at 09:53:05AM -0600, Jens Axboe wrote:
>> So how about we just call it "write_hint"? It sounds mostly like a
>> naming issue to me, as you would then map that to some specific stream
>> in your driver. You're free to do that right now. They are all flags,
>> it's just packed as a value to not waste too much space.
>
> I think what Martin wants (or at least what I'd want him to want) is
> to define a few REQ_* bits that mirror the RWF bits, use that to
> transfer the information down the stack, and then only translate it
> to stream ids in the driver.
Sure, that's fine with me. Let me make that change.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 16:01 ` Christoph Hellwig
2017-06-14 16:02 ` Jens Axboe
@ 2017-06-14 16:04 ` Martin K. Petersen
2017-06-14 16:26 ` Jens Axboe
` (2 more replies)
1 sibling, 3 replies; 32+ messages in thread
From: Martin K. Petersen @ 2017-06-14 16:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Martin K. Petersen, linux-fsdevel, linux-block,
adilger
Christoph,
> I think what Martin wants (or at least what I'd want him to want) is
> to define a few REQ_* bits that mirror the RWF bits, use that to
> transfer the information down the stack, and then only translate it
> to stream ids in the driver.
Yup. If we have enough space in the existing flags that's perfect (I
lost count after your op/flag shuffle).
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 16:04 ` Martin K. Petersen
@ 2017-06-14 16:26 ` Jens Axboe
2017-06-14 17:22 ` Jens Axboe
2017-06-14 23:39 ` Andreas Dilger
2 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 16:26 UTC (permalink / raw)
To: Martin K. Petersen, Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger
On 06/14/2017 10:04 AM, Martin K. Petersen wrote:
>
> Christoph,
>
>> I think what Martin wants (or at least what I'd want him to want) is
>> to define a few REQ_* bits that mirror the RWF bits, use that to
>> transfer the information down the stack, and then only translate it
>> to stream ids in the driver.
>
> Yup. If we have enough space in the existing flags that's perfect (I
> lost count after your op/flag shuffle).
There are enough flags, I just implemented it. I'll run some testing
and rebase everything, then send out a v3...
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 16:04 ` Martin K. Petersen
2017-06-14 16:26 ` Jens Axboe
@ 2017-06-14 17:22 ` Jens Axboe
2017-06-14 23:39 ` Andreas Dilger
2 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-14 17:22 UTC (permalink / raw)
To: Martin K. Petersen, Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger
On 06/14/2017 10:04 AM, Martin K. Petersen wrote:
>
> Christoph,
>
>> I think what Martin wants (or at least what I'd want him to want) is
>> to define a few REQ_* bits that mirror the RWF bits, use that to
>> transfer the information down the stack, and then only translate it
>> to stream ids in the driver.
>
> Yup. If we have enough space in the existing flags that's perfect (I
> lost count after your op/flag shuffle).
OK, diff on top of the current stuff, so you can see how that changes
things. If this looks good to folks, I'll update the series to achieve
the same final result.
diff --git a/block/bio.c b/block/bio.c
index 77f4be1f..25ea7c3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -595,7 +595,6 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_opf = bio_src->bi_opf;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
- bio->bi_stream = bio_src->bi_stream;
bio_clone_blkcg_association(bio, bio_src);
}
@@ -679,7 +678,6 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_opf = bio_src->bi_opf;
bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
- bio->bi_stream = bio_src->bi_stream;
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
@@ -2084,6 +2082,22 @@ void bio_clone_blkcg_association(struct bio *dst, struct bio *src)
#endif /* CONFIG_BLK_CGROUP */
+static const unsigned int rwf_write_to_opf_flag[] = {
+ 0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
+};
+
+/*
+ * 'stream_flags' is one of RWF_WRITE_LIFE_* values
+ */
+void bio_set_streamid(struct bio *bio, unsigned int rwf_flags)
+{
+ if (WARN_ON_ONCE(rwf_flags >= ARRAY_SIZE(rwf_write_to_opf_flag)))
+ return;
+
+ bio->bi_opf |= rwf_write_to_opf_flag[rwf_flags];
+}
+EXPORT_SYMBOL_GPL(bio_set_streamid);
+
static void __init biovec_init_slabs(void)
{
int i;
diff --git a/block/blk-core.c b/block/blk-core.c
index 3f4a206..a7421b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2057,12 +2057,6 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- if (bio_op(bio) == REQ_OP_WRITE &&
- bio_stream(bio) < BLK_MAX_STREAM) {
- q->stream_writes[bio_stream(bio)] +=
- bio->bi_iter.bi_size >> 9;
- }
-
if (likely(blk_queue_enter(q, false) == 0)) {
struct bio_list lower, same;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 28998ac..7d299df 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -696,7 +696,8 @@ static struct request *attempt_merge(struct request_queue *q,
* Don't allow merge of different streams, or for a stream with
* non-stream IO.
*/
- if (req->bio->bi_stream != next->bio->bi_stream)
+ if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+ (next->cmd_flags & REQ_WRITE_LIFE_MASK))
return NULL;
/*
@@ -822,7 +823,8 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
* Don't allow merge of different streams, or for a stream with
* non-stream IO.
*/
- if (rq->bio->bi_stream != bio->bi_stream)
+ if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+ (bio->bi_opf & REQ_WRITE_LIFE_MASK))
return false;
return true;
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d7cbd05..8988133 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -335,6 +335,20 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_MQ_RQ_QUEUE_OK;
}
+static inline unsigned int req_to_streamid(struct request *req)
+{
+ if (req->cmd_flags & REQ_WRITE_SHORT)
+ return 1;
+ else if (req->cmd_flags & REQ_WRITE_MEDIUM)
+ return 2;
+ else if (req->cmd_flags & REQ_WRITE_LONG)
+ return 3;
+ else if (req->cmd_flags & REQ_WRITE_EXTREME)
+ return 4;
+
+ return 0;
+}
+
static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
@@ -355,13 +369,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
- if (req_op(req) == REQ_OP_WRITE) {
- if (bio_stream_valid(req->bio) && ns->nr_streams) {
- unsigned stream = bio_stream(req->bio) & 0xffff;
+ if (req_op(req) == REQ_OP_WRITE && blk_stream_valid(req->cmd_flags) &&
+ ns->nr_streams) {
+ unsigned stream = req_to_streamid(req);
- control |= NVME_RW_DTYPE_STREAMS;
- dsmgmt |= ((stream % (ns->nr_streams + 1)) << 16);
- }
+ control |= NVME_RW_DTYPE_STREAMS;
+ dsmgmt |= ((stream % (ns->nr_streams + 1)) << 16);
+
+ if (stream < BLK_MAX_STREAM)
+ req->q->stream_writes[stream] += blk_rq_bytes(req) >> 9;
}
if (ns->ms) {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 284b8a7..31ba4a8 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -227,7 +227,6 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
bio.bi_iter.bi_sector = pos >> 9;
bio.bi_private = current;
bio.bi_end_io = blkdev_bio_end_io_simple;
- bio.bi_stream = iocb_streamid(iocb);
ret = bio_iov_iter_get_pages(&bio, iter);
if (unlikely(ret))
@@ -240,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
should_dirty = true;
} else {
bio.bi_opf = dio_bio_write_op(iocb);
+ bio_set_streamid(&bio, iocb_streamid(iocb));
task_io_account_write(ret);
}
@@ -361,7 +361,6 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
bio->bi_iter.bi_sector = pos >> 9;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
- bio->bi_stream = iocb_streamid(iocb);
ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) {
@@ -376,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
bio_set_pages_dirty(bio);
} else {
bio->bi_opf = dio_bio_write_op(iocb);
+ bio_set_streamid(bio, iocb_streamid(iocb));
task_io_account_write(bio->bi_iter.bi_size);
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index db0558a..ef3c98c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8608,7 +8608,6 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
atomic_set(&dip->pending_bios, 0);
btrfs_bio = btrfs_io_bio(io_bio);
btrfs_bio->logical = file_offset;
- bio_set_streamid(io_bio, bio_stream(dio_bio));
if (write) {
io_bio->bi_end_io = btrfs_endio_direct_write;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index c9c8b9f..a770e82 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,7 +386,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
- bio->bi_stream = iocb_streamid(dio->iocb);
+ bio_set_streamid(bio, iocb_streamid(dio->iocb));
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0..a1b3145 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -443,6 +443,7 @@ extern struct bio *bio_copy_kern(struct request_queue *, void *, unsigned int,
gfp_t, int);
extern void bio_set_pages_dirty(struct bio *bio);
extern void bio_check_pages_dirty(struct bio *bio);
+extern void bio_set_streamid(struct bio *bio, unsigned int rwf_flags);
void generic_start_io_acct(int rw, unsigned long sectors,
struct hd_struct *part);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1940876..06c8c35 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -36,8 +36,6 @@ struct bio {
unsigned short bi_flags; /* status, etc and bvec pool number */
unsigned short bi_ioprio;
- unsigned int bi_stream; /* write life time hint */
-
struct bvec_iter bi_iter;
/* Number of segments in this BIO after
@@ -203,6 +201,10 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD, /* read ahead, can fail anytime */
__REQ_BACKGROUND, /* background IO */
+ __REQ_WRITE_SHORT, /* short life time write */
+ __REQ_WRITE_MEDIUM, /* medium life time write */
+ __REQ_WRITE_LONG, /* long life time write */
+ __REQ_WRITE_EXTREME, /* extremely long life time write */
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
@@ -223,6 +225,13 @@ enum req_flag_bits {
#define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH)
#define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
+#define REQ_WRITE_SHORT (1ULL << __REQ_WRITE_SHORT)
+#define REQ_WRITE_MEDIUM (1ULL << __REQ_WRITE_MEDIUM)
+#define REQ_WRITE_LONG (1ULL << __REQ_WRITE_LONG)
+#define REQ_WRITE_EXTREME (1ULL << __REQ_WRITE_EXTREME)
+
+#define REQ_WRITE_LIFE_MASK (REQ_WRITE_SHORT | REQ_WRITE_MEDIUM | \
+ REQ_WRITE_LONG | REQ_WRITE_EXTREME)
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
@@ -314,19 +323,9 @@ struct blk_rq_stat {
u64 batch;
};
-static inline void bio_set_streamid(struct bio *bio, unsigned int stream)
-{
- bio->bi_stream = stream;
-}
-
-static inline bool bio_stream_valid(struct bio *bio)
-{
- return bio->bi_stream != 0;
-}
-
-static inline unsigned int bio_stream(struct bio *bio)
+static inline bool blk_stream_valid(unsigned int opf)
{
- return bio->bi_stream;
+ return (opf & REQ_WRITE_LIFE_MASK) != 0;
}
#endif /* __LINUX_BLK_TYPES_H */
--
Jens Axboe
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 16:04 ` Martin K. Petersen
2017-06-14 16:26 ` Jens Axboe
2017-06-14 17:22 ` Jens Axboe
@ 2017-06-14 23:39 ` Andreas Dilger
2017-06-15 3:26 ` Jens Axboe
2 siblings, 1 reply; 32+ messages in thread
From: Andreas Dilger @ 2017-06-14 23:39 UTC (permalink / raw)
To: Martin Petersen; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-block
[-- Attachment #1: Type: text/plain, Size: 1857 bytes --]
On Jun 14, 2017, at 10:04 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote:
> Christoph,
>
>> I think what Martin wants (or at least what I'd want him to want) is
>> to define a few REQ_* bits that mirror the RWF bits, use that to
>> transfer the information down the stack, and then only translate it
>> to stream ids in the driver.
>
> Yup. If we have enough space in the existing flags that's perfect (I
> lost count after your op/flag shuffle).
Just to clarify, does this mean that Jens' "lifetime" hints are going to
be independent of separate "stream ID" values in the future (needing a
similar, but independent, set of fields for stream IDs, or will they
both be implemented using a common transport in the kernel (i.e. both
share a single set of fields in the inode/bio/etc?
I can imagine applications using either the lifetime hints (mapped to
stream IDs internally), or explicitly managing stream IDs, but it seems
unlikely that both would be in use at the same time, unless the "lifetime"
hints are intended to also indirectly map to HSM-like tiered cache/storage
hints internally to the storage stack (e.g. whether writes should be put
into local cachefs vs remote network storage, or bcache NVRAM vs. HDD)?
Jens' patches were moving in the direction of allowing up to 16 stream IDs,
with the default being IDs 1-4 are intended to reflect different data
lifetimes, but allowing up to 15 unique IDs to be specified "per device"*
if userspace wants to manage this more explicitly.
Martin, how many stream IDs do you think are needed? If we need too many,
then we wouldn't be able to pack them into the existing pwritev2() flags
and would need another new API to specify the stream IDs.
[*] more work is needed to handle this on multiple filesystems, MDRAID/LVM, etc.
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-14 23:39 ` Andreas Dilger
@ 2017-06-15 3:26 ` Jens Axboe
2017-06-15 3:53 ` Andreas Dilger
2017-06-16 13:58 ` Christoph Hellwig
0 siblings, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-15 3:26 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Martin Petersen, Christoph Hellwig, fsdevel, linux-block
On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jun 14, 2017, at 10:04 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>> Christoph,
>>
>>> I think what Martin wants (or at least what I'd want him to want) is
>>> to define a few REQ_* bits that mirror the RWF bits, use that to
>>> transfer the information down the stack, and then only translate it
>>> to stream ids in the driver.
>>
>> Yup. If we have enough space in the existing flags that's perfect (I
>> lost count after your op/flag shuffle).
>
> Just to clarify, does this mean that Jens' "lifetime" hints are going to
> be independent of separate "stream ID" values in the future (needing a
> similar, but independent, set of fields for stream IDs, or will they
> both be implemented using a common transport in the kernel (i.e. both
> share a single set of fields in the inode/bio/etc?
There's no reason they can't coexist. Now that bio->bi_stream is gone
and we just have the flags, if we want to expose separate "real" stream
IDs, then we could later re-introduce that. It'd be pretty trivial to
just use the most pertinent information on the driver side.
>From my perspective, all I really care about is the 4 hints. It's a
simple enough interface that applications can understand and use it, and
we don't need any management of actual stream IDs. I think that has the
highest chance of success. Modifying an application to use it is
trivial, even something like RocksDB (if you havehad to make changes
to RocksDB, you'll get this).
> I can imagine applications using either the lifetime hints (mapped to
> stream IDs internally), or explicitly managing stream IDs, but it seems
> unlikely that both would be in use at the same time, unless the "lifetime"
> hints are intended to also indirectly map to HSM-like tiered cache/storage
> hints internally to the storage stack (e.g. whether writes should be put
> into local cachefs vs remote network storage, or bcache NVRAM vs. HDD)?
Agree, both would be viable solutions, and they can coexist. Explicitly
managing stream IDs is a lot more advanced, and the 4 hints will get
most people 95% of the way there. For explicit streams, we'd need
something similar to my previous versions of this patchset, where a full
API is provided.
> Jens' patches were moving in the direction of allowing up to 16 stream IDs,
> with the default being IDs 1-4 are intended to reflect different data
> lifetimes, but allowing up to 15 unique IDs to be specified "per device"*
> if userspace wants to manage this more explicitly.
I don't think we should mix the two up. As mentioned above, one will
require a full API and kernel management of the streams. I had a lot of
push back on that approach previously, which is why I've now gone for
the simpler version. If we want to expose stream IDs explicitly to user
space, that's a lot more involved.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-15 3:26 ` Jens Axboe
@ 2017-06-15 3:53 ` Andreas Dilger
2017-06-15 3:57 ` Jens Axboe
2017-06-16 13:58 ` Christoph Hellwig
1 sibling, 1 reply; 32+ messages in thread
From: Andreas Dilger @ 2017-06-15 3:53 UTC (permalink / raw)
To: Jens Axboe; +Cc: Martin Petersen, Christoph Hellwig, fsdevel, linux-block
[-- Attachment #1: Type: text/plain, Size: 3445 bytes --]
On Jun 14, 2017, at 9:26 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> On Jun 14, 2017, at 10:04 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>> Christoph,
>>>
>>>> I think what Martin wants (or at least what I'd want him to want) is
>>>> to define a few REQ_* bits that mirror the RWF bits, use that to
>>>> transfer the information down the stack, and then only translate it
>>>> to stream ids in the driver.
>>>
>>> Yup. If we have enough space in the existing flags that's perfect (I
>>> lost count after your op/flag shuffle).
>>
>> Just to clarify, does this mean that Jens' "lifetime" hints are going to
>> be independent of separate "stream ID" values in the future (needing a
>> similar, but independent, set of fields for stream IDs, or will they
>> both be implemented using a common transport in the kernel (i.e. both
>> share a single set of fields in the inode/bio/etc?
>
> There's no reason they can't coexist. Now that bio->bi_stream is gone
> and we just have the flags, if we want to expose separate "real" stream
> IDs, then we could later re-introduce that. It'd be pretty trivial to
> just use the most pertinent information on the driver side.
>
> From my perspective, all I really care about is the 4 hints. It's a
> simple enough interface that applications can understand and use it, and
> we don't need any management of actual stream IDs. I think that has the
> highest chance of success. Modifying an application to use it is
> trivial, even something like RocksDB (if you havehad to make changes
> to RocksDB, you'll get this).
If this is really to be only flags, rather than a hybrid of flags and IDs
(as I had thought), then it probably makes sense to limit the inode usage
to a few bits in i_flags using S_LIFETIME_* values rather than a separate
16-bit i_stream field, which can be used for the actual stream IDs later.
>> I can imagine applications using either the lifetime hints (mapped to
>> stream IDs internally), or explicitly managing stream IDs, but it seems
>> unlikely that both would be in use at the same time, unless the "lifetime"
>> hints are intended to also indirectly map to HSM-like tiered cache/storage
>> hints internally to the storage stack (e.g. whether writes should be put
>> into local cachefs vs remote network storage, or bcache NVRAM vs. HDD)?
>
> Agree, both would be viable solutions, and they can coexist. Explicitly
> managing stream IDs is a lot more advanced, and the 4 hints will get
> most people 95% of the way there. For explicit streams, we'd need
> something similar to my previous versions of this patchset, where a full
> API is provided.
>
>> Jens' patches were moving in the direction of allowing up to 16 stream IDs,
>> with the default being IDs 1-4 are intended to reflect different data
>> lifetimes, but allowing up to 15 unique IDs to be specified "per device"*
>> if userspace wants to manage this more explicitly.
>
> I don't think we should mix the two up. As mentioned above, one will
> require a full API and kernel management of the streams. I had a lot of
> push back on that approach previously, which is why I've now gone for
> the simpler version. If we want to expose stream IDs explicitly to user
> space, that's a lot more involved.
>
> --
> Jens Axboe
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-15 3:53 ` Andreas Dilger
@ 2017-06-15 3:57 ` Jens Axboe
2017-06-15 4:38 ` Jens Axboe
0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-06-15 3:57 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Martin Petersen, Christoph Hellwig, fsdevel, linux-block
On 06/14/2017 09:53 PM, Andreas Dilger wrote:
> On Jun 14, 2017, at 9:26 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>> On Jun 14, 2017, at 10:04 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>>> Christoph,
>>>>
>>>>> I think what Martin wants (or at least what I'd want him to want) is
>>>>> to define a few REQ_* bits that mirror the RWF bits, use that to
>>>>> transfer the information down the stack, and then only translate it
>>>>> to stream ids in the driver.
>>>>
>>>> Yup. If we have enough space in the existing flags that's perfect (I
>>>> lost count after your op/flag shuffle).
>>>
>>> Just to clarify, does this mean that Jens' "lifetime" hints are going to
>>> be independent of separate "stream ID" values in the future (needing a
>>> similar, but independent, set of fields for stream IDs, or will they
>>> both be implemented using a common transport in the kernel (i.e. both
>>> share a single set of fields in the inode/bio/etc?
>>
>> There's no reason they can't coexist. Now that bio->bi_stream is gone
>> and we just have the flags, if we want to expose separate "real" stream
>> IDs, then we could later re-introduce that. It'd be pretty trivial to
>> just use the most pertinent information on the driver side.
>>
>> From my perspective, all I really care about is the 4 hints. It's a
>> simple enough interface that applications can understand and use it, and
>> we don't need any management of actual stream IDs. I think that has the
>> highest chance of success. Modifying an application to use it is
>> trivial, even something like RocksDB (if you havehad to make changes
>> to RocksDB, you'll get this).
>
> If this is really to be only flags, rather than a hybrid of flags and IDs
> (as I had thought), then it probably makes sense to limit the inode usage
> to a few bits in i_flags using S_LIFETIME_* values rather than a separate
> 16-bit i_stream field, which can be used for the actual stream IDs later.
Christoph alluded to that as well. And yes, if we are contemplating
something else later on, then that does make more sense. I'll make that
change.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-15 3:57 ` Jens Axboe
@ 2017-06-15 4:38 ` Jens Axboe
0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-15 4:38 UTC (permalink / raw)
To: Andreas Dilger; +Cc: Martin Petersen, Christoph Hellwig, fsdevel, linux-block
On 06/14/2017 09:57 PM, Jens Axboe wrote:
> On 06/14/2017 09:53 PM, Andreas Dilger wrote:
>> On Jun 14, 2017, at 9:26 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>> On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>>> On Jun 14, 2017, at 10:04 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote:
>>>>> Christoph,
>>>>>
>>>>>> I think what Martin wants (or at least what I'd want him to want) is
>>>>>> to define a few REQ_* bits that mirror the RWF bits, use that to
>>>>>> transfer the information down the stack, and then only translate it
>>>>>> to stream ids in the driver.
>>>>>
>>>>> Yup. If we have enough space in the existing flags that's perfect (I
>>>>> lost count after your op/flag shuffle).
>>>>
>>>> Just to clarify, does this mean that Jens' "lifetime" hints are going to
>>>> be independent of separate "stream ID" values in the future (needing a
>>>> similar, but independent, set of fields for stream IDs, or will they
>>>> both be implemented using a common transport in the kernel (i.e. both
>>>> share a single set of fields in the inode/bio/etc?
>>>
>>> There's no reason they can't coexist. Now that bio->bi_stream is gone
>>> and we just have the flags, if we want to expose separate "real" stream
>>> IDs, then we could later re-introduce that. It'd be pretty trivial to
>>> just use the most pertinent information on the driver side.
>>>
>>> From my perspective, all I really care about is the 4 hints. It's a
>>> simple enough interface that applications can understand and use it, and
>>> we don't need any management of actual stream IDs. I think that has the
>>> highest chance of success. Modifying an application to use it is
>>> trivial, even something like RocksDB (if you havehad to make changes
>>> to RocksDB, you'll get this).
>>
>> If this is really to be only flags, rather than a hybrid of flags and IDs
>> (as I had thought), then it probably makes sense to limit the inode usage
>> to a few bits in i_flags using S_LIFETIME_* values rather than a separate
>> 16-bit i_stream field, which can be used for the actual stream IDs later.
>
> Christoph alluded to that as well. And yes, if we are contemplating
> something else later on, then that does make more sense. I'll make that
> change.
Easy enough to do, see attached incremental patch. Only issue is that we
then have to lock the inode, and use the atomic flag setting. I'm assuming
that's safe to do from that path. We only need to grab the lock if the
hint changes, which is basically never should.
diff --git a/fs/inode.c b/fs/inode.c
index bd8bf44f3f31..db5914783a71 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -149,7 +149,6 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
inode->i_blocks = 0;
inode->i_bytes = 0;
inode->i_generation = 0;
- inode->i_write_hint = 0;
inode->i_pipe = NULL;
inode->i_bdev = NULL;
inode->i_cdev = NULL;
diff --git a/fs/read_write.c b/fs/read_write.c
index 9cb2314efca3..70f8764ae117 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -675,6 +675,7 @@ EXPORT_SYMBOL(iov_shorten);
static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
loff_t *ppos, int type, int flags)
{
+ struct inode *inode = file_inode(filp);
struct kiocb kiocb;
ssize_t ret;
@@ -688,12 +689,19 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
kiocb.ki_flags |= IOCB_DSYNC;
if (flags & RWF_SYNC)
kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
- if (flags & RWF_WRITE_LIFE_MASK) {
- struct inode *inode = file_inode(filp);
+ if ((flags & RWF_WRITE_LIFE_MASK) ||
+ (inode->i_flags & S_WRITE_LIFE_MASK)) {
+ unsigned int hint, iflags;
- inode->i_write_hint = (flags & RWF_WRITE_LIFE_MASK) >>
- RWF_WRITE_LIFE_SHIFT;
- kiocb.ki_flags |= inode->i_write_hint << IOCB_WRITE_LIFE_SHIFT;
+ hint = (flags & RWF_WRITE_LIFE_MASK) >> RWF_WRITE_LIFE_SHIFT;
+ iflags = hint << S_WRITE_LIFE_SHIFT;
+
+ if ((inode->i_flags & S_WRITE_LIFE_MASK) != iflags) {
+ inode_lock(inode);
+ inode_set_flags(inode, iflags, iflags);
+ inode_unlock(inode);
+ }
+ kiocb.ki_flags |= hint << IOCB_WRITE_LIFE_SHIFT;
}
kiocb.ki_pos = *ppos;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63798b67fcfe..929f1fc088c6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,10 +270,10 @@ struct writeback_control;
#define IOCB_WRITE (1 << 6)
/*
- * Steal 4-bits for stream information, this allows 16 valid streams
+ * Steal 3 bits for stream information, this allows 8 valid streams
*/
#define IOCB_WRITE_LIFE_SHIFT 7
-#define IOCB_WRITE_LIFE_MASK (BIT(7) | BIT(8) | BIT(9) | BIT(10))
+#define IOCB_WRITE_LIFE_MASK (BIT(7) | BIT(8) | BIT(9))
struct kiocb {
struct file *ki_filp;
@@ -603,7 +603,6 @@ struct inode {
struct timespec i_ctime;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes;
- unsigned short i_write_hint;
unsigned int i_blkbits;
blkcnt_t i_blocks;
@@ -668,14 +667,6 @@ struct inode {
void *i_private; /* fs or device private pointer */
};
-static inline unsigned int inode_write_hint(struct inode *inode)
-{
- if (inode)
- return inode->i_write_hint;
-
- return 0;
-}
-
static inline unsigned int i_blocksize(const struct inode *node)
{
return (1 << node->i_blkbits);
@@ -1849,6 +1840,17 @@ struct super_operations {
#endif
/*
+ * Expected life time hint of a write for this inode. This maps directly
+ * to the RWF_WRITE_LIFE_* hints
+ */
+#define S_WRITE_LIFE_SHIFT 14 /* 16384, next bit */
+#define S_WRITE_LIFE_MASK 0x1c000 /* bits 14..16 */
+#define S_WRITE_LIFE_SHORT (1 << S_WRITE_LIFE_SHIFT)
+#define S_WRITE_LIFE_MEDIUM (2 << S_WRITE_LIFE_SHIFT)
+#define S_WRITE_LIFE_LONG (3 << S_WRITE_LIFE_SHIFT)
+#define S_WRITE_LIFE_EXTREME (4 << S_WRITE_LIFE_SHIFT)
+
+/*
* Note that nosuid etc flags are inode-specific: setting some file-system
* flags just means all the inodes inherit those flags by default. It might be
* possible to override it selectively if you really wanted to with some
@@ -1894,6 +1896,15 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
}
+static inline unsigned int inode_write_hint(struct inode *inode)
+{
+ if (inode)
+ return (inode->i_flags & S_WRITE_LIFE_MASK) >>
+ S_WRITE_LIFE_SHIFT;
+
+ return 0;
+}
+
/*
* Inode state bits. Protected by inode->i_lock
*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 58b7ee06b380..b68dc74236c1 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -362,10 +362,10 @@ struct fscrypt_key {
#define RWF_SYNC 0x00000004 /* per-IO O_SYNC */
/*
- * Data life time write flags, steal 4 bits for that
+ * Data life time write flags, steal 3 bits for that
*/
#define RWF_WRITE_LIFE_SHIFT 4
-#define RWF_WRITE_LIFE_MASK 0x000000f0 /* 4 bits of stream ID */
+#define RWF_WRITE_LIFE_MASK 0x00000070 /* 3 bits of stream ID */
#define RWF_WRITE_LIFE_SHORT (1 << RWF_WRITE_LIFE_SHIFT)
#define RWF_WRITE_LIFE_MEDIUM (2 << RWF_WRITE_LIFE_SHIFT)
#define RWF_WRITE_LIFE_LONG (3 << RWF_WRITE_LIFE_SHIFT)
--
Jens Axboe
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-15 3:26 ` Jens Axboe
2017-06-15 3:53 ` Andreas Dilger
@ 2017-06-16 13:58 ` Christoph Hellwig
2017-06-16 14:40 ` Jens Axboe
1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2017-06-16 13:58 UTC (permalink / raw)
To: Jens Axboe
Cc: Andreas Dilger, Martin Petersen, Christoph Hellwig, fsdevel,
linux-block
> >From my perspective, all I really care about is the 4 hints. It's a
> simple enough interface that applications can understand and use it, and
> we don't need any management of actual stream IDs. I think that has the
> highest chance of success. Modifying an application to use it is
> trivial, even something like RocksDB (if you havehad to make changes
> to RocksDB, you'll get this).
Btw, are your current RocksDB patches available somewhere?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-16 13:58 ` Christoph Hellwig
@ 2017-06-16 14:40 ` Jens Axboe
2017-06-16 15:52 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2017-06-16 14:40 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Andreas Dilger, Martin Petersen, fsdevel, linux-block
On 06/16/2017 07:58 AM, Christoph Hellwig wrote:
>> >From my perspective, all I really care about is the 4 hints. It's a
>> simple enough interface that applications can understand and use it, and
>> we don't need any management of actual stream IDs. I think that has the
>> highest chance of success. Modifying an application to use it is
>> trivial, even something like RocksDB (if you havehad to make changes
>> to RocksDB, you'll get this).
>
> Btw, are your current RocksDB patches available somewhere?
Yep, it's on Mark Callaghan's github:
https://github.com/mdcallag/rocksdb
This is using mostly the v1 I posted, the only difference is that I just
named the four flags RWF_S[1-4].
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-16 14:40 ` Jens Axboe
@ 2017-06-16 15:52 ` Christoph Hellwig
2017-06-16 15:55 ` Jens Axboe
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2017-06-16 15:52 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Andreas Dilger, Martin Petersen, fsdevel,
linux-block
On Fri, Jun 16, 2017 at 08:40:11AM -0600, Jens Axboe wrote:
> On 06/16/2017 07:58 AM, Christoph Hellwig wrote:
> >> >From my perspective, all I really care about is the 4 hints. It's a
> >> simple enough interface that applications can understand and use it, and
> >> we don't need any management of actual stream IDs. I think that has the
> >> highest chance of success. Modifying an application to use it is
> >> trivial, even something like RocksDB (if you havehad to make changes
> >> to RocksDB, you'll get this).
> >
> > Btw, are your current RocksDB patches available somewhere?
>
> Yep, it's on Mark Callaghan's github:
>
> https://github.com/mdcallag/rocksdb
>
> This is using mostly the v1 I posted, the only difference is that I just
> named the four flags RWF_S[1-4].
That commits look a bit like mess, but I guess I shouldn't expect
kernel-style changelogs and patch separation on github :)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET v2] Add support for write life time hints
2017-06-16 15:52 ` Christoph Hellwig
@ 2017-06-16 15:55 ` Jens Axboe
0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2017-06-16 15:55 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Andreas Dilger, Martin Petersen, fsdevel, linux-block
On 06/16/2017 09:52 AM, Christoph Hellwig wrote:
> On Fri, Jun 16, 2017 at 08:40:11AM -0600, Jens Axboe wrote:
>> On 06/16/2017 07:58 AM, Christoph Hellwig wrote:
>>>> >From my perspective, all I really care about is the 4 hints. It's a
>>>> simple enough interface that applications can understand and use it, and
>>>> we don't need any management of actual stream IDs. I think that has the
>>>> highest chance of success. Modifying an application to use it is
>>>> trivial, even something like RocksDB (if you havehad to make changes
>>>> to RocksDB, you'll get this).
>>>
>>> Btw, are your current RocksDB patches available somewhere?
>>
>> Yep, it's on Mark Callaghan's github:
>>
>> https://github.com/mdcallag/rocksdb
>>
>> This is using mostly the v1 I posted, the only difference is that I just
>> named the four flags RWF_S[1-4].
>
> That commits look a bit like mess, but I guess I shouldn't expect
> kernel-style changelogs and patch separation on github :)
It's just a quick test patch, not intended to be pushed upstream in its
current state. But it's good enough to validate it for testing. In summary,
what the patch does is:
- Redo log, level 0 and 1 of the LSM goes to stream 1.
- Level 2 goes to stream 2
- Level 3 goes to stream 3
- Level 4 and above goes to stream 4.
Each level is roughly 10x in size, but written at the same rate. So each
level roughly has a life time 10x of data writes of the previous one.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread