[PATCH v2] Support for write stream IDs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] Support for write stream IDs
@ 2015-04-18 20:03 Jens Axboe
  2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david

Hi,

v2 of this posting. Changes since v1:

- Rebased on top of current master.

- Fix EINVAL -> -EINVAL typo.

- Cleanup up BIO_STREAM_OFFSET definition.

- Pack i_streamid and f_streamid better into struct file and struct
  inode.

- Add a separate per-file hint, FADV_FILE_STREAMID. This only sets
  the write stream on the file, not the inode. FADV_STREAMID sets
  the hint both in the file and the inode.

 block/bio.c                  |    2 ++
 block/blk-core.c             |    3 +++
 fs/btrfs/extent_io.c         |    1 +
 fs/btrfs/inode.c             |    1 +
 fs/buffer.c                  |    4 ++--
 fs/direct-io.c               |    4 ++++
 fs/ext4/page-io.c            |    1 +
 fs/inode.c                   |    1 +
 fs/mpage.c                   |    1 +
 fs/open.c                    |    1 +
 fs/xfs/xfs_aops.c            |    1 +
 include/linux/blk_types.h    |   28 +++++++++++++++++++++++++++-
 include/linux/fs.h           |   22 ++++++++++++++++++++++
 include/uapi/linux/fadvise.h |    3 +++
 mm/fadvise.c                 |   25 +++++++++++++++++++++++++
 15 files changed, 95 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/7] block: add support for carrying a stream ID in a bio
  2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
@ 2015-04-18 20:03 ` Jens Axboe
  2015-04-18 20:03 ` [PATCH 2/7] Add support for per-file/inode stream ID Jens Axboe
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe

The top bits of bio->bi_flags are reserved for keeping the
allocation pool, set aside the next eight bits for carrying
a stream ID. That leaves us with support for 255 streams,
0 is reserved as a "stream not set" value.

Add helpers for setting/getting stream ID of a bio.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/bio.c               |  2 ++
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 28 +++++++++++++++++++++++++++-
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index f66a4eae16ee..1cd3d745047c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -567,6 +567,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_rw = bio_src->bi_rw;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
+	bio_set_streamid(bio, bio_get_streamid(bio_src));
 }
 EXPORT_SYMBOL(__bio_clone_fast);
 
@@ -672,6 +673,7 @@ integrity_clone:
 		}
 	}
 
+	bio_set_streamid(bio, bio_get_streamid(bio_src));
 	return bio;
 }
 EXPORT_SYMBOL(bio_clone_bioset);
diff --git a/block/blk-core.c b/block/blk-core.c
index fd154b94447a..6276ce1ad46b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1941,6 +1941,9 @@ void generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
+		if (bio_streamid_valid(bio))
+			blk_add_trace_msg(q, "StreamID=%u", bio_get_streamid(bio));
+
 		q->make_request_fn(q, bio);
 
 		bio = bio_list_pop(current->bio_list);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1b25e35ea5f..0678c7baa7b1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -138,9 +138,35 @@ struct bio {
 #define BIO_POOL_BITS		(4)
 #define BIO_POOL_NONE		((1UL << BIO_POOL_BITS) - 1)
 #define BIO_POOL_OFFSET		(BITS_PER_LONG - BIO_POOL_BITS)
-#define BIO_POOL_MASK		(1UL << BIO_POOL_OFFSET)
 #define BIO_POOL_IDX(bio)	((bio)->bi_flags >> BIO_POOL_OFFSET)
 
+/*
+ * after the pool bits, next 8 bits are for the stream id
+ */
+#define BIO_STREAM_BITS		(8)
+#define BIO_STREAM_OFFSET	(BIO_POOL_OFFSET - BIO_STREAM_BITS)
+#define BIO_STREAM_MASK		((1 << BIO_STREAM_BITS) - 1)
+
+static inline unsigned long streamid_to_flags(unsigned int id)
+{
+	return (unsigned long) (id & BIO_STREAM_MASK) << BIO_STREAM_OFFSET;
+}
+
+static inline void bio_set_streamid(struct bio *bio, unsigned int id)
+{
+	bio->bi_flags |= streamid_to_flags(id);
+}
+
+static inline unsigned int bio_get_streamid(struct bio *bio)
+{
+	return (bio->bi_flags >> BIO_STREAM_OFFSET) & BIO_STREAM_MASK;
+}
+
+static inline bool bio_streamid_valid(struct bio *bio)
+{
+	return bio_get_streamid(bio) != 0;
+}
+
 #endif /* CONFIG_BLOCK */
 
 /*
-- 
2.4.0.rc2.1.g3d6bc9a

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/7] Add support for per-file/inode stream ID
  2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
  2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe
@ 2015-04-18 20:03 ` Jens Axboe
  2015-04-18 20:03 ` [PATCH 3/7] direct-io: add support for write stream IDs Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe

Writing on flash devices can be much more efficient, if we can
inform the device what kind of data can be grouped together. If
the device is able to group data together with similar lifetimes,
then it can be more efficient in garbage collection. This, in turn,
leads to lower write amplification, which is a win on both device
wear and performance.

Add a new fadvise hint, POSIX_FADV_STREAMID, which sets the file
and inode streamid. The file streamid is used if we have the file
available at the time of the write (O_DIRECT), we use the inode
streamid if not (buffered writeback). The fadvise hint uses the
'offset' field to specify a stream ID. A second
POSIX_FADV_FILE_STREAMID sets only the stream ID on the file,
not the inode.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/inode.c                   |  1 +
 fs/open.c                    |  1 +
 include/linux/fs.h           | 22 ++++++++++++++++++++++
 include/uapi/linux/fadvise.h |  3 +++
 mm/fadvise.c                 | 25 +++++++++++++++++++++++++
 5 files changed, 52 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index f00b16f45507..41885322ba64 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -149,6 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_blocks = 0;
 	inode->i_bytes = 0;
 	inode->i_generation = 0;
+	inode->i_streamid = 0;
 	inode->i_pipe = NULL;
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
diff --git a/fs/open.c b/fs/open.c
index 6796f04d6032..979f34ff165b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -743,6 +743,7 @@ static int do_dentry_open(struct file *f,
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
+	f->f_streamid = 0;
 
 	return 0;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c7496f263860..f43b866c3639 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -648,6 +648,7 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+	unsigned int		i_streamid;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -668,6 +669,14 @@ struct inode {
 	void			*i_private; /* fs or device private pointer */
 };
 
+static inline unsigned int inode_streamid(struct inode *inode)
+{
+	if (inode)
+		return inode->i_streamid;
+
+	return 0;
+}
+
 static inline int inode_unhashed(struct inode *inode)
 {
 	return hlist_unhashed(&inode->i_hash);
@@ -839,6 +848,7 @@ struct file {
 	 * Must not be taken from IRQ context.
 	 */
 	spinlock_t		f_lock;
+	unsigned int		f_streamid;
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
@@ -870,6 +880,18 @@ struct file_handle {
 	unsigned char f_handle[0];
 };
 
+/*
+ * If the file doesn't have a stream ID set, return the inode stream ID
+ * in case that has been set.
+ */
+static inline unsigned int file_streamid(struct file *f)
+{
+	if (f->f_streamid)
+		return f->f_streamid;
+
+	return inode_streamid(f->f_inode);
+}
+
 static inline struct file *get_file(struct file *f)
 {
 	atomic_long_inc(&f->f_count);
diff --git a/include/uapi/linux/fadvise.h b/include/uapi/linux/fadvise.h
index e8e747139b9a..e6e561db7f88 100644
--- a/include/uapi/linux/fadvise.h
+++ b/include/uapi/linux/fadvise.h
@@ -18,4 +18,7 @@
 #define POSIX_FADV_NOREUSE	5 /* Data will be accessed once.  */
 #endif
 
+#define POSIX_FADV_STREAMID	8 /* associate stream ID with file+inode */
+#define POSIX_FADV_FILE_STREAMID 9 /* associate stream ID with file */
+
 #endif	/* FADVISE_H_INCLUDED */
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4a3907cf79f8..a56e81840040 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -60,6 +60,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 		case POSIX_FADV_WILLNEED:
 		case POSIX_FADV_NOREUSE:
 		case POSIX_FADV_DONTNEED:
+		case POSIX_FADV_STREAMID:
+		case POSIX_FADV_FILE_STREAMID:
 			/* no bad return value, but ignore advice */
 			break;
 		default:
@@ -144,6 +146,29 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 			}
 		}
 		break;
+	case POSIX_FADV_STREAMID:
+	case POSIX_FADV_FILE_STREAMID:
+		/*
+		 * streamid is stored in offset... we don't limit or check
+		 * if the device supports streams, or if it does, if the
+		 * stream nr is within the limits. 1 is the lowest valid
+		 * stream id, 0 is "don't care/know".
+		 */
+		if (offset != (unsigned int) offset) {
+			ret = -EINVAL;
+			break;
+		}
+		/*
+		 * FILE_STREAMID stores only in the file, STREAMID stores
+		 * the stream hint in both the file and the inode.
+		 */
+		f.file->f_streamid = offset;
+		if (advice == POSIX_FADV_STREAMID) {
+			spin_lock(&inode->i_lock);
+			inode->i_streamid = offset;
+			spin_unlock(&inode->i_lock);
+		}
+		break;
 	default:
 		ret = -EINVAL;
 	}
-- 
2.4.0.rc2.1.g3d6bc9a

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/7] direct-io: add support for write stream IDs
  2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
  2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe
  2015-04-18 20:03 ` [PATCH 2/7] Add support for per-file/inode stream ID Jens Axboe
@ 2015-04-18 20:03 ` Jens Axboe
  2015-04-18 20:03 ` [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe

Get the streamid from the file, if any, and set it on the bio.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/direct-io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index c3b560b24a46..d318a143b186 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -76,6 +76,7 @@ struct dio_submit {
 	int reap_counter;		/* rate limit reaping */
 	sector_t final_block_in_request;/* doesn't change */
 	int boundary;			/* prev block is at a boundary */
+	int streamid;			/* Write stream ID */
 	get_block_t *get_block;		/* block mapping function */
 	dio_submit_t *submit_io;	/* IO submition function */
 
@@ -371,6 +372,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
+
+	bio_set_streamid(bio, sdio->streamid);
 }
 
 /*
@@ -1201,6 +1204,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	sdio.blkbits = blkbits;
 	sdio.blkfactor = i_blkbits - blkbits;
 	sdio.block_in_file = offset >> blkbits;
+	sdio.streamid = file_streamid(iocb->ki_filp);
 
 	sdio.get_block = get_block;
 	dio->end_io = end_io;
-- 
2.4.0.rc2.1.g3d6bc9a

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page()
  2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
                   ` (2 preceding siblings ...)
  2015-04-18 20:03 ` [PATCH 3/7] direct-io: add support for write stream IDs Jens Axboe
@ 2015-04-18 20:03 ` Jens Axboe
  2015-04-18 20:03 ` [PATCH 5/7] btrfs: add support for write stream IDs Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe

Pass on the inode stream ID to the bio allocation.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/buffer.c | 4 ++--
 fs/mpage.c  | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c7a5602d01ee..5191523cec56 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1774,7 +1774,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			submit_bh(write_op, bh);
+			_submit_bh(write_op, bh, streamid_to_flags(inode_streamid(inode)));
 			nr_underway++;
 		}
 		bh = next;
@@ -1828,7 +1828,7 @@ recover:
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
-			submit_bh(write_op, bh);
+			_submit_bh(write_op, bh, streamid_to_flags(inode_streamid(inode)));
 			nr_underway++;
 		}
 		bh = next;
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220babac..fba13f4b981d 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -605,6 +605,7 @@ alloc_new:
 				bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
 		if (bio == NULL)
 			goto confused;
+		bio_set_streamid(bio, inode_streamid(inode));
 	}
 
 	/*
-- 
2.4.0.rc2.1.g3d6bc9a

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/7] btrfs: add support for write stream IDs
  2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
                   ` (3 preceding siblings ...)
  2015-04-18 20:03 ` [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe
@ 2015-04-18 20:03 ` Jens Axboe
  2015-04-18 20:03 ` [PATCH 6/7] xfs: add support for buffered writeback stream ID Jens Axboe
  2015-04-18 20:03 ` [PATCH 7/7] ext4: add support for write stream IDs Jens Axboe
  6 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe

Both buffered and O_DIRECT.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Chris Mason <clm@fb.com> 
---
 fs/btrfs/extent_io.c | 1 +
 fs/btrfs/inode.c     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d688cfe5d496..2845fae054b6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2838,6 +2838,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
 	bio_add_page(bio, page, page_size, offset);
 	bio->bi_end_io = end_io_func;
 	bio->bi_private = tree;
+	bio_set_streamid(bio, inode_streamid(page->mapping->host));
 
 	if (bio_ret)
 		*bio_ret = bio;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 43192e10cc43..8ff7527e8d05 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8046,6 +8046,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	atomic_set(&dip->pending_bios, 0);
 	btrfs_bio = btrfs_io_bio(io_bio);
 	btrfs_bio->logical = file_offset;
+	bio_set_streamid(io_bio, inode_streamid(inode));
 
 	if (write) {
 		io_bio->bi_end_io = btrfs_endio_direct_write;
-- 
2.4.0.rc2.1.g3d6bc9a


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/7] xfs: add support for buffered writeback stream ID
  2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
                   ` (4 preceding siblings ...)
  2015-04-18 20:03 ` [PATCH 5/7] btrfs: add support for write stream IDs Jens Axboe
@ 2015-04-18 20:03 ` Jens Axboe
  2015-04-18 20:03 ` [PATCH 7/7] ext4: add support for write stream IDs Jens Axboe
  6 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/xfs/xfs_aops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1d8eef9cf0f5..6d166e55de9a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -376,6 +376,7 @@ xfs_submit_ioend_bio(
 	atomic_inc(&ioend->io_remaining);
 	bio->bi_private = ioend;
 	bio->bi_end_io = xfs_end_bio;
+	bio_set_streamid(bio, ioend->io_inode->i_streamid);
 	submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
 }
 
-- 
2.4.0.rc2.1.g3d6bc9a


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 7/7] ext4: add support for write stream IDs
  2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
                   ` (5 preceding siblings ...)
  2015-04-18 20:03 ` [PATCH 6/7] xfs: add support for buffered writeback stream ID Jens Axboe
@ 2015-04-18 20:03 ` Jens Axboe
  6 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/ext4/page-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 464984261e69..392a82925d5f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -389,6 +389,7 @@ submit_and_retry:
 		ret = io_submit_init_bio(io, bh);
 		if (ret)
 			return ret;
+		bio_set_streamid(io->io_bio, inode_streamid(inode));
 	}
 	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
-- 
2.4.0.rc2.1.g3d6bc9a

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2] Support for write stream IDs
@ 2015-05-05 20:02 Jens Axboe
  2015-05-05 20:07 ` Christoph Hellwig
  2015-05-05 20:51 ` Jeff Moyer
  0 siblings, 2 replies; 27+ messages in thread
From: Jens Axboe @ 2015-05-05 20:02 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: adilger, david

Hi,

Changes since the last posting:

- Added a specific per-file fadvise setting. POSIX_FADV_STREAMID sets
  the inode and file stream ID, POSIX_FADV_STREAMID_FILE sets just the
  file stream ID.

- Addressed review comments.

I've since run some testing with write streams. Test case was a RocksDB
overwrite benchmark, using 3 billion keys of 400B in size (numbers set
use the full size of the device). WAL/LOG was assigned to stream 1, and
each RocksDB compaction level used a separate stream. With streams
enabled, user write to device writes (write amplification) was at 2.33.
Without streams, the write amplification was 3.05. That is roughly 20%
less written NAND, and the streams test subsequently also had 20%
higher throughput.

Unless there are any grave concerns here, I'd like to merge this for
4.2.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:02 [PATCH v2] Support " Jens Axboe
@ 2015-05-05 20:07 ` Christoph Hellwig
  2015-05-05 20:12   ` Jens Axboe
  2015-05-05 20:51 ` Jeff Moyer
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2015-05-05 20:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, adilger, david

On Tue, May 05, 2015 at 02:02:54PM -0600, Jens Axboe wrote:
> Unless there are any grave concerns here, I'd like to merge this for
> 4.2.

Without a driver implementing the feature I don't see any reason to even
consider merging it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:07 ` Christoph Hellwig
@ 2015-05-05 20:12   ` Jens Axboe
  2015-05-05 20:20     ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2015-05-05 20:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, adilger, david

On 05/05/2015 02:07 PM, Christoph Hellwig wrote:
> On Tue, May 05, 2015 at 02:02:54PM -0600, Jens Axboe wrote:
>> Unless there are any grave concerns here, I'd like to merge this for
>> 4.2.
>
> Without a driver implementing the feature I don't see any reason to even
> consider merging it.

We can't merge the NVMe bits until the proposal is included/finalized. 
But this is a problem. I don't want to add this to the Facebook kernel 
until we know the API is stable, while I have no problem adding 
experimental NVMe changes since those can be easily updated without 
impacting applications. The latter is not true for the user interface.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:12   ` Jens Axboe
@ 2015-05-05 20:20     ` Christoph Hellwig
  2015-05-05 20:31       ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2015-05-05 20:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, adilger, david

On Tue, May 05, 2015 at 02:12:01PM -0600, Jens Axboe wrote:
> We can't merge the NVMe bits until the proposal is included/finalized. But
> this is a problem. I don't want to add this to the Facebook kernel until we
> know the API is stable, while I have no problem adding experimental NVMe
> changes since those can be easily updated without impacting applications.
> The latter is not true for the user interface.

They might never be finalized, and even if they are mere mortals might
never get this hardware.  Merging infrastructure without any users is a
bad idea in general, and merging infrastructure with no user that
exposes untestable user interface and bloats core data structures is
even worse.  I don't think this has any merit at all at this point.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:20     ` Christoph Hellwig
@ 2015-05-05 20:31       ` Jens Axboe
  2015-05-05 20:43         ` Christoph Hellwig
  2015-05-05 20:50         ` Christoph Hellwig
  0 siblings, 2 replies; 27+ messages in thread
From: Jens Axboe @ 2015-05-05 20:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, adilger, david

On 05/05/2015 02:20 PM, Christoph Hellwig wrote:
> On Tue, May 05, 2015 at 02:12:01PM -0600, Jens Axboe wrote:
>> We can't merge the NVMe bits until the proposal is included/finalized. But
>> this is a problem. I don't want to add this to the Facebook kernel until we
>> know the API is stable, while I have no problem adding experimental NVMe
>> changes since those can be easily updated without impacting applications.
>> The latter is not true for the user interface.
>
> They might never be finalized, and even if they are mere mortals might
> never get this hardware.

The likelihood of not getting streams support is minuscule. The benefits 
are just too large to ignore. It might not look like the current 
proposal, but it will get there. It's not like this is just one NVMe 
member wanting to push this, the only disagreement is whether this is 
going to be implemented as direct write tagging or through queue pairs.

Even outside of that, there are use cases for caching that need not have 
hardware assist.

> Merging infrastructure without any users is a
> bad idea in general, and merging infrastructure with no user that
> exposes untestable user interface and bloats core data structures is
> even worse.  I don't think this has any merit at all at this point.

There is a user, we are using it. And there's no data structure 
bloating, both the file and inode additions are filling existing holes. 
I'll strongly disagree with your statement that it has no merit at all. 
In fact, the merit is quite clear.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:31       ` Jens Axboe
@ 2015-05-05 20:43         ` Christoph Hellwig
  2015-05-05 20:50         ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2015-05-05 20:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, adilger, david

On Tue, May 05, 2015 at 02:31:34PM -0600, Jens Axboe wrote:
> Even outside of that, there are use cases for caching that need not have
> hardware assist.

Could, would.  But in the meantime you'd adding dead wood kernel code,
and even worse user interfaces.

> >Merging infrastructure without any users is a
> >bad idea in general, and merging infrastructure with no user that
> >exposes untestable user interface and bloats core data structures is
> >even worse.  I don't think this has any merit at all at this point.
> 
> There is a user, we are using it. And there's no data structure bloating,
> both the file and inode additions are filling existing holes. I'll strongly
> disagree with your statement that it has no merit at all. In fact, the merit
> is quite clear.

Make sure everyone an get a nvme card from samsung supproting their
hack, and add it to the nvme driver keyed of a PCI ID check and we can
start talking about it.  We're not going to add hacks only a specific
big corporation can use.

Btw, the user interfacess really need man page additions and go past
linux-man and linux-api even in that case.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:31       ` Jens Axboe
  2015-05-05 20:43         ` Christoph Hellwig
@ 2015-05-05 20:50         ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2015-05-05 20:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, adilger, david

On Tue, May 05, 2015 at 02:31:34PM -0600, Jens Axboe wrote:
> There is a user, we are using it. And there's no data structure bloating,
> both the file and inode additions are filling existing holes.

FYI, they do increase the size on any 32-bit architecture, which is
where it hurts most.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:02 [PATCH v2] Support " Jens Axboe
  2015-05-05 20:07 ` Christoph Hellwig
@ 2015-05-05 20:51 ` Jeff Moyer
  2015-05-05 21:05   ` Jens Axboe
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2015-05-05 20:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, adilger, david

Jens Axboe <axboe@fb.com> writes:

> Hi,
>
> Changes since the last posting:
>
> - Added a specific per-file fadvise setting. POSIX_FADV_STREAMID sets
>   the inode and file stream ID, POSIX_FADV_STREAMID_FILE sets just the
>   file stream ID.
>
> - Addressed review comments.
>
> I've since run some testing with write streams. Test case was a RocksDB
> overwrite benchmark, using 3 billion keys of 400B in size (numbers set
> use the full size of the device). WAL/LOG was assigned to stream 1, and
> each RocksDB compaction level used a separate stream. With streams
> enabled, user write to device writes (write amplification) was at 2.33.
> Without streams, the write amplification was 3.05. That is roughly 20%
> less written NAND, and the streams test subsequently also had 20%
> higher throughput.
>
> Unless there are any grave concerns here, I'd like to merge this for
> 4.2.

I have a few concerns.  You've added POSIX_FADV_* definitions that do
not exist in the SUS/POSIX spec.  Do we care?  We (poor reviewers) still
have no idea what the driver side of this will look like.  Do streams
need to be opened and closed?  Is that going to be handled transparently
by the kernel, or exposed to userspace?  If in the kernel, where in the
kernel?  You've also added a user-visible api without cc-ing linux-api.

My preference would be to wait for the spec to finalize before pushing
in changes that depend on it.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 20:51 ` Jeff Moyer
@ 2015-05-05 21:05   ` Jens Axboe
  2015-05-05 21:39     ` Martin K. Petersen
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2015-05-05 21:05 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, linux-fsdevel, adilger, david

On 05/05/2015 02:51 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@fb.com> writes:
>
>> Hi,
>>
>> Changes since the last posting:
>>
>> - Added a specific per-file fadvise setting. POSIX_FADV_STREAMID sets
>>    the inode and file stream ID, POSIX_FADV_STREAMID_FILE sets just the
>>    file stream ID.
>>
>> - Addressed review comments.
>>
>> I've since run some testing with write streams. Test case was a RocksDB
>> overwrite benchmark, using 3 billion keys of 400B in size (numbers set
>> use the full size of the device). WAL/LOG was assigned to stream 1, and
>> each RocksDB compaction level used a separate stream. With streams
>> enabled, user write to device writes (write amplification) was at 2.33.
>> Without streams, the write amplification was 3.05. That is roughly 20%
>> less written NAND, and the streams test subsequently also had 20%
>> higher throughput.
>>
>> Unless there are any grave concerns here, I'd like to merge this for
>> 4.2.
>
> I have a few concerns.  You've added POSIX_FADV_* definitions that do
> not exist in the SUS/POSIX spec.  Do we care?  We (poor reviewers) still
> have no idea what the driver side of this will look like.  Do streams
> need to be opened and closed?  Is that going to be handled transparently
> by the kernel, or exposed to userspace?  If in the kernel, where in the
> kernel?  You've also added a user-visible api without cc-ing linux-api.

Whether this should be fadvise, fcntl, or something else, that's the 
primary review concern.

The driver side depends on the driver! The kernel patches deal only with 
ensuring that the stream information gets passed down. If the device 
requires explicit stream open/close actions, then that needs to be 
handled on the side. There's no reason to include the kernel in that, 
the kernel doesn't care.

> My preference would be to wait for the spec to finalize before pushing
> in changes that depend on it.

I think you are mixing up the write streams with the NVMe proposal. The 
two aren't necessarily connected, and the kernel parts don't really care 
what the NVMe proposal ends up looking like. It's just an interface to 
assign an ID, and a transport for passing that ID down to a driver.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 21:05   ` Jens Axboe
@ 2015-05-05 21:39     ` Martin K. Petersen
  2015-05-05 21:48       ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Martin K. Petersen @ 2015-05-05 21:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david

>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:

Jens> The kernel patches deal only with ensuring that the stream
Jens> information gets passed down. If the device requires explicit
Jens> stream open/close actions, then that needs to be handled on the
Jens> side.

Well, I'm deeply concerned about that "on the side" thing. I know you're
trying to make a shortcut by only caring about the transport mechanism
and deferring the whole programming model piece of the equation.

I think the latter is hugely important, though. And if we defer the
programming model question then the chances are that we'll be stuck with
something lame in the standards.

The storage vendors appear to think that a handful of stream ids ought
to be enough for anybody. And that IDs are a scarce resource that
they'll hand out for a limited time if you ask nicely.

I think that model is completely broken and also of limited use for
non-NVM types of storage. Sadly it's the same people who are driving the
NVMe and SCSI proposals so they are similar.

My concern is that by adding just enough plumbing to the kernel to allow
the current standards proposals to work then we'll be stuck with them
forever. And I think that would be very unfortunate.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 21:39     ` Martin K. Petersen
@ 2015-05-05 21:48       ` Jens Axboe
  2015-05-05 22:09         ` Martin K. Petersen
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2015-05-05 21:48 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david

On 05/05/2015 03:39 PM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>
> Jens> The kernel patches deal only with ensuring that the stream
> Jens> information gets passed down. If the device requires explicit
> Jens> stream open/close actions, then that needs to be handled on the
> Jens> side.
>
> Well, I'm deeply concerned about that "on the side" thing. I know you're
> trying to make a shortcut by only caring about the transport mechanism
> and deferring the whole programming model piece of the equation.

I'm not trying to make a shortcut. I deliberately do not want to make ID 
generation/assignment part of the kernel. There's no reason that can't 
exist outside of the kernel, in a libstreamid or similar. In fact I 
already wrote that for the NVMe bits, it could be extended if other 
types of devices require open/close type actions for streams.

> I think the latter is hugely important, though. And if we defer the
> programming model question then the chances are that we'll be stuck with
> something lame in the standards.
>
> The storage vendors appear to think that a handful of stream ids ought
> to be enough for anybody. And that IDs are a scarce resource that
> they'll hand out for a limited time if you ask nicely.

I agree, that's an issue. And I just don't understand why that is. It's 
not like containerized or virtualized setups are esoteric these days. A 
handful of IDs might be enough for a single application type setup, but 
not for anything more elaborate.

> I think that model is completely broken and also of limited use for
> non-NVM types of storage. Sadly it's the same people who are driving the
> NVMe and SCSI proposals so they are similar.

It's completely broken for NVM as well. Most people need to farm out the 
use of such devices, as no single app can drive it to its limit anyways.

> My concern is that by adding just enough plumbing to the kernel to allow
> the current standards proposals to work then we'll be stuck with them
> forever. And I think that would be very unfortunate.

The current API doesn't have any real limits (it'll work from 
1..MAX_UINT), and the transport part handles 255 streams at the moment. 
The latter can be easily extended, we can just steal a few more bits. 
Making it 1023 would be a one liner.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 21:48       ` Jens Axboe
@ 2015-05-05 22:09         ` Martin K. Petersen
  2015-05-06 14:26           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Martin K. Petersen @ 2015-05-05 22:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Jeff Moyer, linux-kernel, linux-fsdevel,
	adilger, david

>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:

Jens> I'm not trying to make a shortcut. I deliberately do not want to
Jens> make ID generation/assignment part of the kernel. There's no
Jens> reason that can't exist outside of the kernel, in a libstreamid or
Jens> similar.

That just perpetuates the broken model, though. Why wouldn't we want to
have stream ids readily available inside the kernel to tag journals,
filesystem metadata, data migration, who knows what?

Having storage micromanage the stream IDs is a non-starter. And it'll
also break things like software RAID, btrfs, LVM, anything that involves
multiple devices. ID X on first RAID disk then needs to be mapped to ID
Y on the second, etc.

The only sensible solution is for the kernel to manage the stream
IDs. And for them to be plentiful. The storage device is free to ignore
them, do LRU or whatever it pleases to manage them if it has an internal
limit on number of open streams, etc.

Jens> The current API doesn't have any real limits (it'll work from
Jens> 1..MAX_UINT), and the transport part handles 255 streams at the
Jens> moment. The latter can be easily extended, we can just steal a few
Jens> more bits. Making it 1023 would be a one liner.

I'm not so worried about the implementation. I'm more worried about it
being conducive to the broken proposal that's on the table.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 22:09         ` Martin K. Petersen
@ 2015-05-06 14:26           ` Peter Zijlstra
  2015-05-06 17:25             ` Jens Axboe
  2015-05-06 16:50           ` Boaz Harrosh
  2015-05-06 17:21           ` Jens Axboe
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2015-05-06 14:26 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, Jeff Moyer, linux-kernel, linux-fsdevel, adilger,
	david

On Tue, May 05, 2015 at 06:09:18PM -0400, Martin K. Petersen wrote:
> >>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
> 
> Jens> I'm not trying to make a shortcut. I deliberately do not want to
> Jens> make ID generation/assignment part of the kernel. There's no
> Jens> reason that can't exist outside of the kernel, in a libstreamid or
> Jens> similar.
> 
> That just perpetuates the broken model, though. Why wouldn't we want to
> have stream ids readily available inside the kernel to tag journals,
> filesystem metadata, data migration, who knows what?

> The only sensible solution is for the kernel to manage the stream
> IDs. And for them to be plentiful. The storage device is free to ignore
> them, do LRU or whatever it pleases to manage them if it has an internal
> limit on number of open streams, etc.

I agree; one of the primary tasks of an operating system kernel is
arbitrage and control of hardware resources.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-06 14:26           ` Peter Zijlstra
@ 2015-05-06 17:25             ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-05-06 17:25 UTC (permalink / raw)
  To: Peter Zijlstra, Martin K. Petersen
  Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david

On 05/06/2015 08:26 AM, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 06:09:18PM -0400, Martin K. Petersen wrote:
>>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>>
>> Jens> I'm not trying to make a shortcut. I deliberately do not want to
>> Jens> make ID generation/assignment part of the kernel. There's no
>> Jens> reason that can't exist outside of the kernel, in a libstreamid or
>> Jens> similar.
>>
>> That just perpetuates the broken model, though. Why wouldn't we want to
>> have stream ids readily available inside the kernel to tag journals,
>> filesystem metadata, data migration, who knows what?
>
>> The only sensible solution is for the kernel to manage the stream
>> IDs. And for them to be plentiful. The storage device is free to ignore
>> them, do LRU or whatever it pleases to manage them if it has an internal
>> limit on number of open streams, etc.
>
> I agree; one of the primary tasks of an operating system kernel is
> arbitrage and control of hardware resources.

Sure, that's what the kernel does, but the problem here includes that 
the IDs space and generation is currently being done by the device. So 
it's not a simple as divvying out an open space of resources.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 22:09         ` Martin K. Petersen
  2015-05-06 14:26           ` Peter Zijlstra
@ 2015-05-06 16:50           ` Boaz Harrosh
  2015-05-06 17:21           ` Jens Axboe
  2 siblings, 0 replies; 27+ messages in thread
From: Boaz Harrosh @ 2015-05-06 16:50 UTC (permalink / raw)
  To: Martin K. Petersen, Jens Axboe
  Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david

On 05/06/2015 01:09 AM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
<>
> 
> The only sensible solution is for the kernel to manage the stream
> IDs. And for them to be plentiful. The storage device is free to ignore
> them, do LRU or whatever it pleases to manage them if it has an internal
> limit on number of open streams, etc.
> 

Sometimes you do not have control over the "storage device" firmware
and/or it is already too late. But in such a case the LLD of the device
can do the state management (open/close) and translation (LRU).
If there are a lot of broken devices with the same small size that need
an LRU and/or state, bunch of LLDs can reuse a common library.

But hey yes I'm with Martin here. It sounds like a filesystem thing
and not an application thing.

>From the application side we already have the O_TMP hint and other
fadvise hits, but it is more the filesystem policy of what to do
with this.

Also current proposed solution for application is kind of a layering
violation. I set an Id at the filesystem level API: Open a file,
set an ID. But I acquire the ID from the block device under
the FS. This will never map well. In fact it calls for only a very
hard-coded hardware specific application that can use this.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-05 22:09         ` Martin K. Petersen
  2015-05-06 14:26           ` Peter Zijlstra
  2015-05-06 16:50           ` Boaz Harrosh
@ 2015-05-06 17:21           ` Jens Axboe
  2015-05-07 19:19             ` Martin K. Petersen
  2 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2015-05-06 17:21 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david

On 05/05/2015 04:09 PM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>
> Jens> I'm not trying to make a shortcut. I deliberately do not want to
> Jens> make ID generation/assignment part of the kernel. There's no
> Jens> reason that can't exist outside of the kernel, in a libstreamid or
> Jens> similar.
>
> That just perpetuates the broken model, though. Why wouldn't we want to
> have stream ids readily available inside the kernel to tag journals,
> filesystem metadata, data migration, who knows what?
>
> Having storage micromanage the stream IDs is a non-starter. And it'll
> also break things like software RAID, btrfs, LVM, anything that involves
> multiple devices. ID X on first RAID disk then needs to be mapped to ID
> Y on the second, etc.
>
> The only sensible solution is for the kernel to manage the stream
> IDs. And for them to be plentiful. The storage device is free to ignore
> them, do LRU or whatever it pleases to manage them if it has an internal
> limit on number of open streams, etc.

OK, that does make some sense. That would mean putting the ID management 
in the kernel, where devices would register a handler to be a part of 
this process. That would need to include mapping between user/kernel 
stream IDx and device stream IDx, since they would not necessarily be 
the same. This assumes we will have devices that manage their own 
streams, which seems to be the safe bet (that's what is currently out 
there). That would work for stacked/btrfs setups too.

This wont solve the problem of devices having too few streams. But it'll 
work regardless, we'll just have to push them separately to do that. 
It's not an easy problem for them either, resource constraints on the 
device side could exclude supporting as many streams as we would ideally 
want.

> Jens> The current API doesn't have any real limits (it'll work from
> Jens> 1..MAX_UINT), and the transport part handles 255 streams at the
> Jens> moment. The latter can be easily extended, we can just steal a few
> Jens> more bits. Making it 1023 would be a one liner.
>
> I'm not so worried about the implementation. I'm more worried about it
> being conducive to the broken proposal that's on the table.

In some ways I get it, you have to start somewhere. The current proposal 
is useful for _some_ cases, it's not great for everything. As long as it 
can be expanded to support as many streams as we would want, then it 
would work. It's (again) a bit of a chicken and egg problem. We need to 
make some progress, or the whole thing is going to go away. And I think 
that'd be a shame, since there's definitely merit to passing these 
lifetime hints to the device.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-06 17:21           ` Jens Axboe
@ 2015-05-07 19:19             ` Martin K. Petersen
  2015-05-08 18:48               ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Martin K. Petersen @ 2015-05-07 19:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Jeff Moyer, linux-kernel, linux-fsdevel,
	adilger, david

>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:

Jens> This wont solve the problem of devices having too few streams. But
Jens> it'll work regardless, we'll just have to push them separately to
Jens> do that. It's not an easy problem for them either, resource
Jens> constraints on the device side could exclude supporting as many
Jens> streams as we would ideally want.

But they already have to manage *every* other resource that way: Read
cache, write cache, flash channels, open zones on ZAC/ZBC. If they run
out of memory and have to internally close one stream context to open
another that's their business. If the concurrent ID count is low,
performance their particular widgets is going to suck for some
applications and people will avoid them. Boo hoo.

I'm super happy the SSD industry (well, the market) came to its senses
and abolished all the outrageous demands put on the I/O stack to
overcome erase block size and write amplification issues. Now all that's
a solved problem and we can move on.

Next problem child was the host managed zoned disk madness. Yet another
device implementation headache that suddenly requires us to reinvent
filesystems and the entire I/O stack.

Next in the pipeline is the stream ID stuff. Which once again puts the
burden on us to overcome device implementation issues and misunderstands
how operating systems work.

There are two fundamental problems:

 - The standards are developed by device vendors with little to no input
   from the OS vendors

 - The standards proposals are written, edited, and declared complete
   before anybody actually tries to implement them

That's how we end up with all these lame duck spec extensions that are
device implementation-specific and impossible to use generically.

I scream as loudly as I can. But I am but one voice in a sea of device
vendors. And unless we Linux developers start pushing back in unison
we'll end up in a quagmire of epic proportions.

Jens> In some ways I get it, you have to start somewhere. The current
Jens> proposal is useful for _some_ cases, it's not great for
Jens> everything. As long as it can be expanded to support as many
Jens> streams as we would want, then it would work. It's (again) a bit
Jens> of a chicken and egg problem. We need to make some progress, or
Jens> the whole thing is going to go away. And I think that'd be a
Jens> shame, since there's definitely merit to passing these lifetime
Jens> hints to the device.

The problem is that once these things end up in the standards, the CDB
fields are gone. And if the clunky intermediate stuff is "good enough"
then the motivation to fix things properly goes away.

There are many, many reasons why stream IDs are a good thing. Above and
beyond what the current proposals want. The notion of tagging is a much
better abstraction than bootiness and guessing a percentage for how
sequential future accesses might be. It's a simple, clean interface that
the device--regardless of media type and implementation--can benefit
from.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-07 19:19             ` Martin K. Petersen
@ 2015-05-08 18:48               ` Jens Axboe
  2015-05-12  2:50                 ` Martin K. Petersen
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2015-05-08 18:48 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david

On 05/07/2015 01:19 PM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>
> Jens> This wont solve the problem of devices having too few streams. But
> Jens> it'll work regardless, we'll just have to push them separately to
> Jens> do that. It's not an easy problem for them either, resource
> Jens> constraints on the device side could exclude supporting as many
> Jens> streams as we would ideally want.
>
> But they already have to manage *every* other resource that way: Read
> cache, write cache, flash channels, open zones on ZAC/ZBC. If they run
> out of memory and have to internally close one stream context to open
> another that's their business. If the concurrent ID count is low,
> performance their particular widgets is going to suck for some
> applications and people will avoid them. Boo hoo.

There are actual technical challenges on the device side that sometimes 
interferes. Say you currently implemented your flash device as a log 
based structure. The only way to even support streams is to have more 
logs you can append to. So perhaps then you can't support streams, boo 
hoo for them. Maybe you don't have a fixed log, you can write anywhere. 
But you can only have so many erase blocks open at one time. Not a huge 
concern, you have to manage that and open/close streams as you need to. 
That's basic resource management. But even if you do that, erase blocks 
are not small these days, even for big devices there are only so many of 
them (we're talking thousands in total, not millions). There's a very 
real lower upper bound there on what can be supported.

It's easy enough being on the device side and punting everything to the 
OS. Why wouldn't you? Then it's out of your hair. At the same time, on 
the other side, there's also an OS tendency to whine and want 
everything, and helpfully all the time.

The reality is that we can't demand that devices support thousands of 
streams. It'd be nice if they did and we didn't have to care at all, but 
realistically, that is not going to happen nor is it a completely sane 
demand. And while that may not be perfect, it's still a worthwhile 
improvement and wont preclude a hopefully rosier future where we have 
more streams. Lets say we have 8 streams now. We need some sort of 
policy to multiplex those streams. That's the current challenge. I can 
add the kernel managed streams, and I'll do that.

> I'm super happy the SSD industry (well, the market) came to its senses
> and abolished all the outrageous demands put on the I/O stack to
> overcome erase block size and write amplification issues. Now all that's
> a solved problem and we can move on.

I would not say write amplification is a "solved issue", in fact we're 
attempting to improve it with this :-). But I know what you mean, and 
yes, that was a sad situation. I don't think it's a fair comparison, though.

> Next problem child was the host managed zoned disk madness. Yet another
> device implementation headache that suddenly requires us to reinvent
> filesystems and the entire I/O stack.
>
> Next in the pipeline is the stream ID stuff. Which once again puts the
> burden on us to overcome device implementation issues and misunderstands
> how operating systems work.

Again, I don't think that's a fair comparison. Write streams are useful. 
And adding support for write streams, even in a limited fashion, can be 
directly extended when/if more stream IDs are available. The only change 
would be in the management policy. The basic premise of open stream, use 
stream, close stream - those would be the same. I get that you don't 
like that we need to manually open and close streams, but honestly, 
those are useful hints to the device, even if they don't need to do 
anything about them explicitly.

> There are two fundamental problems:
>
>   - The standards are developed by device vendors with little to no input
>     from the OS vendors
>
>   - The standards proposals are written, edited, and declared complete
>     before anybody actually tries to implement them
>
> That's how we end up with all these lame duck spec extensions that are
> device implementation-specific and impossible to use generically.

The write streams proposal was already approved by t10...

> There are many, many reasons why stream IDs are a good thing. Above and
> beyond what the current proposals want. The notion of tagging is a much
> better abstraction than bootiness and guessing a percentage for how
> sequential future accesses might be. It's a simple, clean interface that
> the device--regardless of media type and implementation--can benefit
> from.

Exactly, which is why the streams is the first hinting mechanism that I 
actually think can work. The previous stuff has been utter crap, and 
I've always ignored it for exactly that reason. If we want/need policy 
on top of the streams, we can implement that independently.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] Support for write stream IDs
  2015-05-08 18:48               ` Jens Axboe
@ 2015-05-12  2:50                 ` Martin K. Petersen
  0 siblings, 0 replies; 27+ messages in thread
From: Martin K. Petersen @ 2015-05-12  2:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Jeff Moyer, linux-kernel, linux-fsdevel,
	adilger, david

>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:

Jens,

Jens> There are actual technical challenges on the device side that
Jens> sometimes interferes. [...]

Right now we use the same protocol to speak to USB keys and million
dollar storage arrays. That's because the protocol was designed to be
abstract and completely device agnostic.

What's happening with flash devices and SMR is that all of a sudden
device implementation challenges are being addressed by putting them in
the protocol and punting them to the OS.

That's an obvious and cost-saving approach for a device vendor to take.
But the world would be a different place if we were still dealing with
MFM, RLL, C/H/S addressing and other implementation-specific horrors of
the past. And if that approach had continued we would explicitly have to
deal with erase blocks on USB sticks and manually drive RAID logic
inside disk arrays. But thankfully, with a few exceptions, we didn't go
there.

My beef with the current stream ID stuff and ZAC/ZBC is that those are
steps in the wrong direction in that they are both exclusively focused
on addressing implementation challenges specific to certain kinds of
devices.

The notion of letting the OS tag things as belonging together or being
independent is a useful concept that benefits *any* kind of device.
Those tags can easily be mapped to resource streams in a flash device or
a particular zone cache segment on an SMR drive or in an array.

I would just like the tag to be completely arbitrary so we can manage it
on behalf of all applications and devices. That puts the burden on the
device to manage the OS tag to internal resource mapping but I think
that's a small price to pay to have a concept that works for all classes
of devices, software RAID, etc.

This does not in any way preclude the device communicating "I'd prefer
if you only kept 8 streams going/8 zones open" like we do with all the
other device characteristics. My gripe is that the programming model is
being forcefully changed so we now have to get a permit before
submitting an I/O. And very aggressively clean up since the permits are
being envisioned as super scarce.

Jens> The reality is that we can't demand that devices support thousands
Jens> of streams.

Why not? It's just a number. Tracking a large number of independent
streams hasn't been a problem for storage arrays. Nobody says a 32-bit
ID requires you to concurrently track bazillions of streams. Pick a
reasonable number of live contexts given your device's actual resources.

Jens> The write streams proposal was already approved by t10...

Nope. It's still being discussed.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-05-12  2:50 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe
2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe
2015-04-18 20:03 ` [PATCH 2/7] Add support for per-file/inode stream ID Jens Axboe
2015-04-18 20:03 ` [PATCH 3/7] direct-io: add support for write stream IDs Jens Axboe
2015-04-18 20:03 ` [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe
2015-04-18 20:03 ` [PATCH 5/7] btrfs: add support for write stream IDs Jens Axboe
2015-04-18 20:03 ` [PATCH 6/7] xfs: add support for buffered writeback stream ID Jens Axboe
2015-04-18 20:03 ` [PATCH 7/7] ext4: add support for write stream IDs Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2015-05-05 20:02 [PATCH v2] Support " Jens Axboe
2015-05-05 20:07 ` Christoph Hellwig
2015-05-05 20:12   ` Jens Axboe
2015-05-05 20:20     ` Christoph Hellwig
2015-05-05 20:31       ` Jens Axboe
2015-05-05 20:43         ` Christoph Hellwig
2015-05-05 20:50         ` Christoph Hellwig
2015-05-05 20:51 ` Jeff Moyer
2015-05-05 21:05   ` Jens Axboe
2015-05-05 21:39     ` Martin K. Petersen
2015-05-05 21:48       ` Jens Axboe
2015-05-05 22:09         ` Martin K. Petersen
2015-05-06 14:26           ` Peter Zijlstra
2015-05-06 17:25             ` Jens Axboe
2015-05-06 16:50           ` Boaz Harrosh
2015-05-06 17:21           ` Jens Axboe
2015-05-07 19:19             ` Martin K. Petersen
2015-05-08 18:48               ` Jens Axboe
2015-05-12  2:50                 ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).