* [PATCH v2] Support for write stream IDs @ 2015-04-18 20:03 Jens Axboe 2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe ` (6 more replies) 0 siblings, 7 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david Hi, v2 of this posting. Changes since v1: - Rebased on top of current master. - Fix EINVAL -> -EINVAL typo. - Cleanup up BIO_STREAM_OFFSET definition. - Pack i_streamid and f_streamid better into struct file and struct inode. - Add a separate per-file hint, FADV_FILE_STREAMID. This only sets the write stream on the file, not the inode. FADV_STREAMID sets the hint both in the file and the inode. block/bio.c | 2 ++ block/blk-core.c | 3 +++ fs/btrfs/extent_io.c | 1 + fs/btrfs/inode.c | 1 + fs/buffer.c | 4 ++-- fs/direct-io.c | 4 ++++ fs/ext4/page-io.c | 1 + fs/inode.c | 1 + fs/mpage.c | 1 + fs/open.c | 1 + fs/xfs/xfs_aops.c | 1 + include/linux/blk_types.h | 28 +++++++++++++++++++++++++++- include/linux/fs.h | 22 ++++++++++++++++++++++ include/uapi/linux/fadvise.h | 3 +++ mm/fadvise.c | 25 +++++++++++++++++++++++++ 15 files changed, 95 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/7] block: add support for carrying a stream ID in a bio 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe @ 2015-04-18 20:03 ` Jens Axboe 2015-04-18 20:03 ` [PATCH 2/7] Add support for per-file/inode stream ID Jens Axboe ` (5 subsequent siblings) 6 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe The top bits of bio->bi_flags are reserved for keeping the allocation pool, set aside the next eight bits for carrying a stream ID. That leaves us with support for 255 streams, 0 is reserved as a "stream not set" value. Add helpers for setting/getting stream ID of a bio. Signed-off-by: Jens Axboe <axboe@fb.com> --- block/bio.c | 2 ++ block/blk-core.c | 3 +++ include/linux/blk_types.h | 28 +++++++++++++++++++++++++++- 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index f66a4eae16ee..1cd3d745047c 100644 --- a/block/bio.c +++ b/block/bio.c @@ -567,6 +567,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio->bi_rw = bio_src->bi_rw; bio->bi_iter = bio_src->bi_iter; bio->bi_io_vec = bio_src->bi_io_vec; + bio_set_streamid(bio, bio_get_streamid(bio_src)); } EXPORT_SYMBOL(__bio_clone_fast); @@ -672,6 +673,7 @@ integrity_clone: } } + bio_set_streamid(bio, bio_get_streamid(bio_src)); return bio; } EXPORT_SYMBOL(bio_clone_bioset); diff --git a/block/blk-core.c b/block/blk-core.c index fd154b94447a..6276ce1ad46b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1941,6 +1941,9 @@ void generic_make_request(struct bio *bio) do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); + if (bio_streamid_valid(bio)) + blk_add_trace_msg(q, "StreamID=%u", bio_get_streamid(bio)); + q->make_request_fn(q, bio); bio = bio_list_pop(current->bio_list); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index a1b25e35ea5f..0678c7baa7b1 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -138,9 +138,35 @@ struct bio { #define BIO_POOL_BITS (4) #define BIO_POOL_NONE ((1UL << BIO_POOL_BITS) - 1) #define BIO_POOL_OFFSET (BITS_PER_LONG - BIO_POOL_BITS) -#define BIO_POOL_MASK (1UL << BIO_POOL_OFFSET) #define BIO_POOL_IDX(bio) ((bio)->bi_flags >> BIO_POOL_OFFSET) +/* + * after the pool bits, next 8 bits are for the stream id + */ +#define BIO_STREAM_BITS (8) +#define BIO_STREAM_OFFSET (BIO_POOL_OFFSET - BIO_STREAM_BITS) +#define BIO_STREAM_MASK ((1 << BIO_STREAM_BITS) - 1) + +static inline unsigned long streamid_to_flags(unsigned int id) +{ + return (unsigned long) (id & BIO_STREAM_MASK) << BIO_STREAM_OFFSET; +} + +static inline void bio_set_streamid(struct bio *bio, unsigned int id) +{ + bio->bi_flags |= streamid_to_flags(id); +} + +static inline unsigned int bio_get_streamid(struct bio *bio) +{ + return (bio->bi_flags >> BIO_STREAM_OFFSET) & BIO_STREAM_MASK; +} + +static inline bool bio_streamid_valid(struct bio *bio) +{ + return bio_get_streamid(bio) != 0; +} + #endif /* CONFIG_BLOCK */ /* -- 2.4.0.rc2.1.g3d6bc9a ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 2/7] Add support for per-file/inode stream ID 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe 2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe @ 2015-04-18 20:03 ` Jens Axboe 2015-04-18 20:03 ` [PATCH 3/7] direct-io: add support for write stream IDs Jens Axboe ` (4 subsequent siblings) 6 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe Writing on flash devices can be much more efficient, if we can inform the device what kind of data can be grouped together. If the device is able to group data together with similar lifetimes, then it can be more efficient in garbage collection. This, in turn, leads to lower write amplification, which is a win on both device wear and performance. Add a new fadvise hint, POSIX_FADV_STREAMID, which sets the file and inode streamid. The file streamid is used if we have the file available at the time of the write (O_DIRECT), we use the inode streamid if not (buffered writeback). The fadvise hint uses the 'offset' field to specify a stream ID. A second POSIX_FADV_FILE_STREAMID sets only the stream ID on the file, not the inode. Signed-off-by: Jens Axboe <axboe@fb.com> --- fs/inode.c | 1 + fs/open.c | 1 + include/linux/fs.h | 22 ++++++++++++++++++++++ include/uapi/linux/fadvise.h | 3 +++ mm/fadvise.c | 25 +++++++++++++++++++++++++ 5 files changed, 52 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index f00b16f45507..41885322ba64 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -149,6 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) inode->i_blocks = 0; inode->i_bytes = 0; inode->i_generation = 0; + inode->i_streamid = 0; inode->i_pipe = NULL; inode->i_bdev = NULL; inode->i_cdev = NULL; diff --git a/fs/open.c b/fs/open.c index 6796f04d6032..979f34ff165b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -743,6 +743,7 @@ static int do_dentry_open(struct file *f, f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC); file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping); + f->f_streamid = 0; return 0; diff --git a/include/linux/fs.h b/include/linux/fs.h index c7496f263860..f43b866c3639 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -648,6 +648,7 @@ struct inode { #ifdef CONFIG_IMA atomic_t i_readcount; /* struct files open RO */ #endif + unsigned int i_streamid; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct file_lock_context *i_flctx; struct address_space i_data; @@ -668,6 +669,14 @@ struct inode { void *i_private; /* fs or device private pointer */ }; +static inline unsigned int inode_streamid(struct inode *inode) +{ + if (inode) + return inode->i_streamid; + + return 0; +} + static inline int inode_unhashed(struct inode *inode) { return hlist_unhashed(&inode->i_hash); @@ -839,6 +848,7 @@ struct file { * Must not be taken from IRQ context. */ spinlock_t f_lock; + unsigned int f_streamid; atomic_long_t f_count; unsigned int f_flags; fmode_t f_mode; @@ -870,6 +880,18 @@ struct file_handle { unsigned char f_handle[0]; }; +/* + * If the file doesn't have a stream ID set, return the inode stream ID + * in case that has been set. + */ +static inline unsigned int file_streamid(struct file *f) +{ + if (f->f_streamid) + return f->f_streamid; + + return inode_streamid(f->f_inode); +} + static inline struct file *get_file(struct file *f) { atomic_long_inc(&f->f_count); diff --git a/include/uapi/linux/fadvise.h b/include/uapi/linux/fadvise.h index e8e747139b9a..e6e561db7f88 100644 --- a/include/uapi/linux/fadvise.h +++ b/include/uapi/linux/fadvise.h @@ -18,4 +18,7 @@ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define POSIX_FADV_STREAMID 8 /* associate stream ID with file+inode */ +#define POSIX_FADV_FILE_STREAMID 9 /* associate stream ID with file */ + #endif /* FADVISE_H_INCLUDED */ diff --git a/mm/fadvise.c b/mm/fadvise.c index 4a3907cf79f8..a56e81840040 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -60,6 +60,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) case POSIX_FADV_WILLNEED: case POSIX_FADV_NOREUSE: case POSIX_FADV_DONTNEED: + case POSIX_FADV_STREAMID: + case POSIX_FADV_FILE_STREAMID: /* no bad return value, but ignore advice */ break; default: @@ -144,6 +146,29 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) } } break; + case POSIX_FADV_STREAMID: + case POSIX_FADV_FILE_STREAMID: + /* + * streamid is stored in offset... we don't limit or check + * if the device supports streams, or if it does, if the + * stream nr is within the limits. 1 is the lowest valid + * stream id, 0 is "don't care/know". + */ + if (offset != (unsigned int) offset) { + ret = -EINVAL; + break; + } + /* + * FILE_STREAMID stores only in the file, STREAMID stores + * the stream hint in both the file and the inode. + */ + f.file->f_streamid = offset; + if (advice == POSIX_FADV_STREAMID) { + spin_lock(&inode->i_lock); + inode->i_streamid = offset; + spin_unlock(&inode->i_lock); + } + break; default: ret = -EINVAL; } -- 2.4.0.rc2.1.g3d6bc9a ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 3/7] direct-io: add support for write stream IDs 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe 2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe 2015-04-18 20:03 ` [PATCH 2/7] Add support for per-file/inode stream ID Jens Axboe @ 2015-04-18 20:03 ` Jens Axboe 2015-04-18 20:03 ` [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe ` (3 subsequent siblings) 6 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe Get the streamid from the file, if any, and set it on the bio. Signed-off-by: Jens Axboe <axboe@fb.com> --- fs/direct-io.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/direct-io.c b/fs/direct-io.c index c3b560b24a46..d318a143b186 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -76,6 +76,7 @@ struct dio_submit { int reap_counter; /* rate limit reaping */ sector_t final_block_in_request;/* doesn't change */ int boundary; /* prev block is at a boundary */ + int streamid; /* Write stream ID */ get_block_t *get_block; /* block mapping function */ dio_submit_t *submit_io; /* IO submition function */ @@ -371,6 +372,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; + + bio_set_streamid(bio, sdio->streamid); } /* @@ -1201,6 +1204,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, sdio.blkbits = blkbits; sdio.blkfactor = i_blkbits - blkbits; sdio.block_in_file = offset >> blkbits; + sdio.streamid = file_streamid(iocb->ki_filp); sdio.get_block = get_block; dio->end_io = end_io; -- 2.4.0.rc2.1.g3d6bc9a ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page() 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe ` (2 preceding siblings ...) 2015-04-18 20:03 ` [PATCH 3/7] direct-io: add support for write stream IDs Jens Axboe @ 2015-04-18 20:03 ` Jens Axboe 2015-04-18 20:03 ` [PATCH 5/7] btrfs: add support for write stream IDs Jens Axboe ` (2 subsequent siblings) 6 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe Pass on the inode stream ID to the bio allocation. Signed-off-by: Jens Axboe <axboe@fb.com> --- fs/buffer.c | 4 ++-- fs/mpage.c | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index c7a5602d01ee..5191523cec56 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1774,7 +1774,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, do { struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { - submit_bh(write_op, bh); + _submit_bh(write_op, bh, streamid_to_flags(inode_streamid(inode))); nr_underway++; } bh = next; @@ -1828,7 +1828,7 @@ recover: struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { clear_buffer_dirty(bh); - submit_bh(write_op, bh); + _submit_bh(write_op, bh, streamid_to_flags(inode_streamid(inode))); nr_underway++; } bh = next; diff --git a/fs/mpage.c b/fs/mpage.c index 3e79220babac..fba13f4b981d 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -605,6 +605,7 @@ alloc_new: bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); if (bio == NULL) goto confused; + bio_set_streamid(bio, inode_streamid(inode)); } /* -- 2.4.0.rc2.1.g3d6bc9a ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 5/7] btrfs: add support for write stream IDs 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe ` (3 preceding siblings ...) 2015-04-18 20:03 ` [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe @ 2015-04-18 20:03 ` Jens Axboe 2015-04-18 20:03 ` [PATCH 6/7] xfs: add support for buffered writeback stream ID Jens Axboe 2015-04-18 20:03 ` [PATCH 7/7] ext4: add support for write stream IDs Jens Axboe 6 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe Both buffered and O_DIRECT. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Chris Mason <clm@fb.com> --- fs/btrfs/extent_io.c | 1 + fs/btrfs/inode.c | 1 + 2 files changed, 2 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d688cfe5d496..2845fae054b6 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2838,6 +2838,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree, bio_add_page(bio, page, page_size, offset); bio->bi_end_io = end_io_func; bio->bi_private = tree; + bio_set_streamid(bio, inode_streamid(page->mapping->host)); if (bio_ret) *bio_ret = bio; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 43192e10cc43..8ff7527e8d05 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8046,6 +8046,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio, atomic_set(&dip->pending_bios, 0); btrfs_bio = btrfs_io_bio(io_bio); btrfs_bio->logical = file_offset; + bio_set_streamid(io_bio, inode_streamid(inode)); if (write) { io_bio->bi_end_io = btrfs_endio_direct_write; -- 2.4.0.rc2.1.g3d6bc9a ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 6/7] xfs: add support for buffered writeback stream ID 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe ` (4 preceding siblings ...) 2015-04-18 20:03 ` [PATCH 5/7] btrfs: add support for write stream IDs Jens Axboe @ 2015-04-18 20:03 ` Jens Axboe 2015-04-18 20:03 ` [PATCH 7/7] ext4: add support for write stream IDs Jens Axboe 6 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe Signed-off-by: Jens Axboe <axboe@fb.com> --- fs/xfs/xfs_aops.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 1d8eef9cf0f5..6d166e55de9a 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -376,6 +376,7 @@ xfs_submit_ioend_bio( atomic_inc(&ioend->io_remaining); bio->bi_private = ioend; bio->bi_end_io = xfs_end_bio; + bio_set_streamid(bio, ioend->io_inode->i_streamid); submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio); } -- 2.4.0.rc2.1.g3d6bc9a ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 7/7] ext4: add support for write stream IDs 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe ` (5 preceding siblings ...) 2015-04-18 20:03 ` [PATCH 6/7] xfs: add support for buffered writeback stream ID Jens Axboe @ 2015-04-18 20:03 ` Jens Axboe 6 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-04-18 20:03 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel; +Cc: ming.l, adilger, david, Jens Axboe Signed-off-by: Jens Axboe <axboe@fb.com> --- fs/ext4/page-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 464984261e69..392a82925d5f 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -389,6 +389,7 @@ submit_and_retry: ret = io_submit_init_bio(io, bh); if (ret) return ret; + bio_set_streamid(io->io_bio, inode_streamid(inode)); } ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); if (ret != bh->b_size) -- 2.4.0.rc2.1.g3d6bc9a ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2] Support for write stream IDs @ 2015-05-05 20:02 Jens Axboe 2015-05-05 20:07 ` Christoph Hellwig 2015-05-05 20:51 ` Jeff Moyer 0 siblings, 2 replies; 27+ messages in thread From: Jens Axboe @ 2015-05-05 20:02 UTC (permalink / raw) To: linux-kernel, linux-fsdevel; +Cc: adilger, david Hi, Changes since the last posting: - Added a specific per-file fadvise setting. POSIX_FADV_STREAMID sets the inode and file stream ID, POSIX_FADV_STREAMID_FILE sets just the file stream ID. - Addressed review comments. I've since run some testing with write streams. Test case was a RocksDB overwrite benchmark, using 3 billion keys of 400B in size (numbers set use the full size of the device). WAL/LOG was assigned to stream 1, and each RocksDB compaction level used a separate stream. With streams enabled, user write to device writes (write amplification) was at 2.33. Without streams, the write amplification was 3.05. That is roughly 20% less written NAND, and the streams test subsequently also had 20% higher throughput. Unless there are any grave concerns here, I'd like to merge this for 4.2. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:02 [PATCH v2] Support " Jens Axboe @ 2015-05-05 20:07 ` Christoph Hellwig 2015-05-05 20:12 ` Jens Axboe 2015-05-05 20:51 ` Jeff Moyer 1 sibling, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2015-05-05 20:07 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, adilger, david On Tue, May 05, 2015 at 02:02:54PM -0600, Jens Axboe wrote: > Unless there are any grave concerns here, I'd like to merge this for > 4.2. Without a driver implementing the feature I don't see any reason to even consider merging it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:07 ` Christoph Hellwig @ 2015-05-05 20:12 ` Jens Axboe 2015-05-05 20:20 ` Christoph Hellwig 0 siblings, 1 reply; 27+ messages in thread From: Jens Axboe @ 2015-05-05 20:12 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, adilger, david On 05/05/2015 02:07 PM, Christoph Hellwig wrote: > On Tue, May 05, 2015 at 02:02:54PM -0600, Jens Axboe wrote: >> Unless there are any grave concerns here, I'd like to merge this for >> 4.2. > > Without a driver implementing the feature I don't see any reason to even > consider merging it. We can't merge the NVMe bits until the proposal is included/finalized. But this is a problem. I don't want to add this to the Facebook kernel until we know the API is stable, while I have no problem adding experimental NVMe changes since those can be easily updated without impacting applications. The latter is not true for the user interface. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:12 ` Jens Axboe @ 2015-05-05 20:20 ` Christoph Hellwig 2015-05-05 20:31 ` Jens Axboe 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2015-05-05 20:20 UTC (permalink / raw) To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, adilger, david On Tue, May 05, 2015 at 02:12:01PM -0600, Jens Axboe wrote: > We can't merge the NVMe bits until the proposal is included/finalized. But > this is a problem. I don't want to add this to the Facebook kernel until we > know the API is stable, while I have no problem adding experimental NVMe > changes since those can be easily updated without impacting applications. > The latter is not true for the user interface. They might never be finalized, and even if they are mere mortals might never get this hardware. Merging infrastructure without any users is a bad idea in general, and merging infrastructure with no user that exposes untestable user interface and bloats core data structures is even worse. I don't think this has any merit at all at this point. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:20 ` Christoph Hellwig @ 2015-05-05 20:31 ` Jens Axboe 2015-05-05 20:43 ` Christoph Hellwig 2015-05-05 20:50 ` Christoph Hellwig 0 siblings, 2 replies; 27+ messages in thread From: Jens Axboe @ 2015-05-05 20:31 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, adilger, david On 05/05/2015 02:20 PM, Christoph Hellwig wrote: > On Tue, May 05, 2015 at 02:12:01PM -0600, Jens Axboe wrote: >> We can't merge the NVMe bits until the proposal is included/finalized. But >> this is a problem. I don't want to add this to the Facebook kernel until we >> know the API is stable, while I have no problem adding experimental NVMe >> changes since those can be easily updated without impacting applications. >> The latter is not true for the user interface. > > They might never be finalized, and even if they are mere mortals might > never get this hardware. The likelihood of not getting streams support is minuscule. The benefits are just too large to ignore. It might not look like the current proposal, but it will get there. It's not like this is just one NVMe member wanting to push this, the only disagreement is whether this is going to be implemented as direct write tagging or through queue pairs. Even outside of that, there are use cases for caching that need not have hardware assist. > Merging infrastructure without any users is a > bad idea in general, and merging infrastructure with no user that > exposes untestable user interface and bloats core data structures is > even worse. I don't think this has any merit at all at this point. There is a user, we are using it. And there's no data structure bloating, both the file and inode additions are filling existing holes. I'll strongly disagree with your statement that it has no merit at all. In fact, the merit is quite clear. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:31 ` Jens Axboe @ 2015-05-05 20:43 ` Christoph Hellwig 2015-05-05 20:50 ` Christoph Hellwig 1 sibling, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2015-05-05 20:43 UTC (permalink / raw) To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, adilger, david On Tue, May 05, 2015 at 02:31:34PM -0600, Jens Axboe wrote: > Even outside of that, there are use cases for caching that need not have > hardware assist. Could, would. But in the meantime you'd adding dead wood kernel code, and even worse user interfaces. > >Merging infrastructure without any users is a > >bad idea in general, and merging infrastructure with no user that > >exposes untestable user interface and bloats core data structures is > >even worse. I don't think this has any merit at all at this point. > > There is a user, we are using it. And there's no data structure bloating, > both the file and inode additions are filling existing holes. I'll strongly > disagree with your statement that it has no merit at all. In fact, the merit > is quite clear. Make sure everyone an get a nvme card from samsung supproting their hack, and add it to the nvme driver keyed of a PCI ID check and we can start talking about it. We're not going to add hacks only a specific big corporation can use. Btw, the user interfacess really need man page additions and go past linux-man and linux-api even in that case. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:31 ` Jens Axboe 2015-05-05 20:43 ` Christoph Hellwig @ 2015-05-05 20:50 ` Christoph Hellwig 1 sibling, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2015-05-05 20:50 UTC (permalink / raw) To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, adilger, david On Tue, May 05, 2015 at 02:31:34PM -0600, Jens Axboe wrote: > There is a user, we are using it. And there's no data structure bloating, > both the file and inode additions are filling existing holes. FYI, they do increase the size on any 32-bit architecture, which is where it hurts most. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:02 [PATCH v2] Support " Jens Axboe 2015-05-05 20:07 ` Christoph Hellwig @ 2015-05-05 20:51 ` Jeff Moyer 2015-05-05 21:05 ` Jens Axboe 1 sibling, 1 reply; 27+ messages in thread From: Jeff Moyer @ 2015-05-05 20:51 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, adilger, david Jens Axboe <axboe@fb.com> writes: > Hi, > > Changes since the last posting: > > - Added a specific per-file fadvise setting. POSIX_FADV_STREAMID sets > the inode and file stream ID, POSIX_FADV_STREAMID_FILE sets just the > file stream ID. > > - Addressed review comments. > > I've since run some testing with write streams. Test case was a RocksDB > overwrite benchmark, using 3 billion keys of 400B in size (numbers set > use the full size of the device). WAL/LOG was assigned to stream 1, and > each RocksDB compaction level used a separate stream. With streams > enabled, user write to device writes (write amplification) was at 2.33. > Without streams, the write amplification was 3.05. That is roughly 20% > less written NAND, and the streams test subsequently also had 20% > higher throughput. > > Unless there are any grave concerns here, I'd like to merge this for > 4.2. I have a few concerns. You've added POSIX_FADV_* definitions that do not exist in the SUS/POSIX spec. Do we care? We (poor reviewers) still have no idea what the driver side of this will look like. Do streams need to be opened and closed? Is that going to be handled transparently by the kernel, or exposed to userspace? If in the kernel, where in the kernel? You've also added a user-visible api without cc-ing linux-api. My preference would be to wait for the spec to finalize before pushing in changes that depend on it. Cheers, Jeff ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 20:51 ` Jeff Moyer @ 2015-05-05 21:05 ` Jens Axboe 2015-05-05 21:39 ` Martin K. Petersen 0 siblings, 1 reply; 27+ messages in thread From: Jens Axboe @ 2015-05-05 21:05 UTC (permalink / raw) To: Jeff Moyer; +Cc: linux-kernel, linux-fsdevel, adilger, david On 05/05/2015 02:51 PM, Jeff Moyer wrote: > Jens Axboe <axboe@fb.com> writes: > >> Hi, >> >> Changes since the last posting: >> >> - Added a specific per-file fadvise setting. POSIX_FADV_STREAMID sets >> the inode and file stream ID, POSIX_FADV_STREAMID_FILE sets just the >> file stream ID. >> >> - Addressed review comments. >> >> I've since run some testing with write streams. Test case was a RocksDB >> overwrite benchmark, using 3 billion keys of 400B in size (numbers set >> use the full size of the device). WAL/LOG was assigned to stream 1, and >> each RocksDB compaction level used a separate stream. With streams >> enabled, user write to device writes (write amplification) was at 2.33. >> Without streams, the write amplification was 3.05. That is roughly 20% >> less written NAND, and the streams test subsequently also had 20% >> higher throughput. >> >> Unless there are any grave concerns here, I'd like to merge this for >> 4.2. > > I have a few concerns. You've added POSIX_FADV_* definitions that do > not exist in the SUS/POSIX spec. Do we care? We (poor reviewers) still > have no idea what the driver side of this will look like. Do streams > need to be opened and closed? Is that going to be handled transparently > by the kernel, or exposed to userspace? If in the kernel, where in the > kernel? You've also added a user-visible api without cc-ing linux-api. Whether this should be fadvise, fcntl, or something else, that's the primary review concern. The driver side depends on the driver! The kernel patches deal only with ensuring that the stream information gets passed down. If the device requires explicit stream open/close actions, then that needs to be handled on the side. There's no reason to include the kernel in that, the kernel doesn't care. > My preference would be to wait for the spec to finalize before pushing > in changes that depend on it. I think you are mixing up the write streams with the NVMe proposal. The two aren't necessarily connected, and the kernel parts don't really care what the NVMe proposal ends up looking like. It's just an interface to assign an ID, and a transport for passing that ID down to a driver. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 21:05 ` Jens Axboe @ 2015-05-05 21:39 ` Martin K. Petersen 2015-05-05 21:48 ` Jens Axboe 0 siblings, 1 reply; 27+ messages in thread From: Martin K. Petersen @ 2015-05-05 21:39 UTC (permalink / raw) To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david >>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: Jens> The kernel patches deal only with ensuring that the stream Jens> information gets passed down. If the device requires explicit Jens> stream open/close actions, then that needs to be handled on the Jens> side. Well, I'm deeply concerned about that "on the side" thing. I know you're trying to make a shortcut by only caring about the transport mechanism and deferring the whole programming model piece of the equation. I think the latter is hugely important, though. And if we defer the programming model question then the chances are that we'll be stuck with something lame in the standards. The storage vendors appear to think that a handful of stream ids ought to be enough for anybody. And that IDs are a scarce resource that they'll hand out for a limited time if you ask nicely. I think that model is completely broken and also of limited use for non-NVM types of storage. Sadly it's the same people who are driving the NVMe and SCSI proposals so they are similar. My concern is that by adding just enough plumbing to the kernel to allow the current standards proposals to work then we'll be stuck with them forever. And I think that would be very unfortunate. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 21:39 ` Martin K. Petersen @ 2015-05-05 21:48 ` Jens Axboe 2015-05-05 22:09 ` Martin K. Petersen 0 siblings, 1 reply; 27+ messages in thread From: Jens Axboe @ 2015-05-05 21:48 UTC (permalink / raw) To: Martin K. Petersen Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david On 05/05/2015 03:39 PM, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: > > Jens> The kernel patches deal only with ensuring that the stream > Jens> information gets passed down. If the device requires explicit > Jens> stream open/close actions, then that needs to be handled on the > Jens> side. > > Well, I'm deeply concerned about that "on the side" thing. I know you're > trying to make a shortcut by only caring about the transport mechanism > and deferring the whole programming model piece of the equation. I'm not trying to make a shortcut. I deliberately do not want to make ID generation/assignment part of the kernel. There's no reason that can't exist outside of the kernel, in a libstreamid or similar. In fact I already wrote that for the NVMe bits, it could be extended if other types of devices require open/close type actions for streams. > I think the latter is hugely important, though. And if we defer the > programming model question then the chances are that we'll be stuck with > something lame in the standards. > > The storage vendors appear to think that a handful of stream ids ought > to be enough for anybody. And that IDs are a scarce resource that > they'll hand out for a limited time if you ask nicely. I agree, that's an issue. And I just don't understand why that is. It's not like containerized or virtualized setups are esoteric these days. A handful of IDs might be enough for a single application type setup, but not for anything more elaborate. > I think that model is completely broken and also of limited use for > non-NVM types of storage. Sadly it's the same people who are driving the > NVMe and SCSI proposals so they are similar. It's completely broken for NVM as well. Most people need to farm out the use of such devices, as no single app can drive it to its limit anyways. > My concern is that by adding just enough plumbing to the kernel to allow > the current standards proposals to work then we'll be stuck with them > forever. And I think that would be very unfortunate. The current API doesn't have any real limits (it'll work from 1..MAX_UINT), and the transport part handles 255 streams at the moment. The latter can be easily extended, we can just steal a few more bits. Making it 1023 would be a one liner. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 21:48 ` Jens Axboe @ 2015-05-05 22:09 ` Martin K. Petersen 2015-05-06 14:26 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Martin K. Petersen @ 2015-05-05 22:09 UTC (permalink / raw) To: Jens Axboe Cc: Martin K. Petersen, Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david >>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: Jens> I'm not trying to make a shortcut. I deliberately do not want to Jens> make ID generation/assignment part of the kernel. There's no Jens> reason that can't exist outside of the kernel, in a libstreamid or Jens> similar. That just perpetuates the broken model, though. Why wouldn't we want to have stream ids readily available inside the kernel to tag journals, filesystem metadata, data migration, who knows what? Having storage micromanage the stream IDs is a non-starter. And it'll also break things like software RAID, btrfs, LVM, anything that involves multiple devices. ID X on first RAID disk then needs to be mapped to ID Y on the second, etc. The only sensible solution is for the kernel to manage the stream IDs. And for them to be plentiful. The storage device is free to ignore them, do LRU or whatever it pleases to manage them if it has an internal limit on number of open streams, etc. Jens> The current API doesn't have any real limits (it'll work from Jens> 1..MAX_UINT), and the transport part handles 255 streams at the Jens> moment. The latter can be easily extended, we can just steal a few Jens> more bits. Making it 1023 would be a one liner. I'm not so worried about the implementation. I'm more worried about it being conducive to the broken proposal that's on the table. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 22:09 ` Martin K. Petersen @ 2015-05-06 14:26 ` Peter Zijlstra 2015-05-06 17:25 ` Jens Axboe 2015-05-06 16:50 ` Boaz Harrosh 2015-05-06 17:21 ` Jens Axboe 2 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2015-05-06 14:26 UTC (permalink / raw) To: Martin K. Petersen Cc: Jens Axboe, Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david On Tue, May 05, 2015 at 06:09:18PM -0400, Martin K. Petersen wrote: > >>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: > > Jens> I'm not trying to make a shortcut. I deliberately do not want to > Jens> make ID generation/assignment part of the kernel. There's no > Jens> reason that can't exist outside of the kernel, in a libstreamid or > Jens> similar. > > That just perpetuates the broken model, though. Why wouldn't we want to > have stream ids readily available inside the kernel to tag journals, > filesystem metadata, data migration, who knows what? > The only sensible solution is for the kernel to manage the stream > IDs. And for them to be plentiful. The storage device is free to ignore > them, do LRU or whatever it pleases to manage them if it has an internal > limit on number of open streams, etc. I agree; one of the primary tasks of an operating system kernel is arbitrage and control of hardware resources. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-06 14:26 ` Peter Zijlstra @ 2015-05-06 17:25 ` Jens Axboe 0 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2015-05-06 17:25 UTC (permalink / raw) To: Peter Zijlstra, Martin K. Petersen Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david On 05/06/2015 08:26 AM, Peter Zijlstra wrote: > On Tue, May 05, 2015 at 06:09:18PM -0400, Martin K. Petersen wrote: >>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: >> >> Jens> I'm not trying to make a shortcut. I deliberately do not want to >> Jens> make ID generation/assignment part of the kernel. There's no >> Jens> reason that can't exist outside of the kernel, in a libstreamid or >> Jens> similar. >> >> That just perpetuates the broken model, though. Why wouldn't we want to >> have stream ids readily available inside the kernel to tag journals, >> filesystem metadata, data migration, who knows what? > >> The only sensible solution is for the kernel to manage the stream >> IDs. And for them to be plentiful. The storage device is free to ignore >> them, do LRU or whatever it pleases to manage them if it has an internal >> limit on number of open streams, etc. > > I agree; one of the primary tasks of an operating system kernel is > arbitrage and control of hardware resources. Sure, that's what the kernel does, but the problem here includes that the IDs space and generation is currently being done by the device. So it's not a simple as divvying out an open space of resources. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 22:09 ` Martin K. Petersen 2015-05-06 14:26 ` Peter Zijlstra @ 2015-05-06 16:50 ` Boaz Harrosh 2015-05-06 17:21 ` Jens Axboe 2 siblings, 0 replies; 27+ messages in thread From: Boaz Harrosh @ 2015-05-06 16:50 UTC (permalink / raw) To: Martin K. Petersen, Jens Axboe Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david On 05/06/2015 01:09 AM, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: <> > > The only sensible solution is for the kernel to manage the stream > IDs. And for them to be plentiful. The storage device is free to ignore > them, do LRU or whatever it pleases to manage them if it has an internal > limit on number of open streams, etc. > Sometimes you do not have control over the "storage device" firmware and/or it is already too late. But in such a case the LLD of the device can do the state management (open/close) and translation (LRU). If there are a lot of broken devices with the same small size that need an LRU and/or state, bunch of LLDs can reuse a common library. But hey yes I'm with Martin here. It sounds like a filesystem thing and not an application thing. >From the application side we already have the O_TMP hint and other fadvise hits, but it is more the filesystem policy of what to do with this. Also current proposed solution for application is kind of a layering violation. I set an Id at the filesystem level API: Open a file, set an ID. But I acquire the ID from the block device under the FS. This will never map well. In fact it calls for only a very hard-coded hardware specific application that can use this. Thanks Boaz ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-05 22:09 ` Martin K. Petersen 2015-05-06 14:26 ` Peter Zijlstra 2015-05-06 16:50 ` Boaz Harrosh @ 2015-05-06 17:21 ` Jens Axboe 2015-05-07 19:19 ` Martin K. Petersen 2 siblings, 1 reply; 27+ messages in thread From: Jens Axboe @ 2015-05-06 17:21 UTC (permalink / raw) To: Martin K. Petersen Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david On 05/05/2015 04:09 PM, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: > > Jens> I'm not trying to make a shortcut. I deliberately do not want to > Jens> make ID generation/assignment part of the kernel. There's no > Jens> reason that can't exist outside of the kernel, in a libstreamid or > Jens> similar. > > That just perpetuates the broken model, though. Why wouldn't we want to > have stream ids readily available inside the kernel to tag journals, > filesystem metadata, data migration, who knows what? > > Having storage micromanage the stream IDs is a non-starter. And it'll > also break things like software RAID, btrfs, LVM, anything that involves > multiple devices. ID X on first RAID disk then needs to be mapped to ID > Y on the second, etc. > > The only sensible solution is for the kernel to manage the stream > IDs. And for them to be plentiful. The storage device is free to ignore > them, do LRU or whatever it pleases to manage them if it has an internal > limit on number of open streams, etc. OK, that does make some sense. That would mean putting the ID management in the kernel, where devices would register a handler to be a part of this process. That would need to include mapping between user/kernel stream IDx and device stream IDx, since they would not necessarily be the same. This assumes we will have devices that manage their own streams, which seems to be the safe bet (that's what is currently out there). That would work for stacked/btrfs setups too. This wont solve the problem of devices having too few streams. But it'll work regardless, we'll just have to push them separately to do that. It's not an easy problem for them either, resource constraints on the device side could exclude supporting as many streams as we would ideally want. > Jens> The current API doesn't have any real limits (it'll work from > Jens> 1..MAX_UINT), and the transport part handles 255 streams at the > Jens> moment. The latter can be easily extended, we can just steal a few > Jens> more bits. Making it 1023 would be a one liner. > > I'm not so worried about the implementation. I'm more worried about it > being conducive to the broken proposal that's on the table. In some ways I get it, you have to start somewhere. The current proposal is useful for _some_ cases, it's not great for everything. As long as it can be expanded to support as many streams as we would want, then it would work. It's (again) a bit of a chicken and egg problem. We need to make some progress, or the whole thing is going to go away. And I think that'd be a shame, since there's definitely merit to passing these lifetime hints to the device. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-06 17:21 ` Jens Axboe @ 2015-05-07 19:19 ` Martin K. Petersen 2015-05-08 18:48 ` Jens Axboe 0 siblings, 1 reply; 27+ messages in thread From: Martin K. Petersen @ 2015-05-07 19:19 UTC (permalink / raw) To: Jens Axboe Cc: Martin K. Petersen, Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david >>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: Jens> This wont solve the problem of devices having too few streams. But Jens> it'll work regardless, we'll just have to push them separately to Jens> do that. It's not an easy problem for them either, resource Jens> constraints on the device side could exclude supporting as many Jens> streams as we would ideally want. But they already have to manage *every* other resource that way: Read cache, write cache, flash channels, open zones on ZAC/ZBC. If they run out of memory and have to internally close one stream context to open another that's their business. If the concurrent ID count is low, performance their particular widgets is going to suck for some applications and people will avoid them. Boo hoo. I'm super happy the SSD industry (well, the market) came to its senses and abolished all the outrageous demands put on the I/O stack to overcome erase block size and write amplification issues. Now all that's a solved problem and we can move on. Next problem child was the host managed zoned disk madness. Yet another device implementation headache that suddenly requires us to reinvent filesystems and the entire I/O stack. Next in the pipeline is the stream ID stuff. Which once again puts the burden on us to overcome device implementation issues and misunderstands how operating systems work. There are two fundamental problems: - The standards are developed by device vendors with little to no input from the OS vendors - The standards proposals are written, edited, and declared complete before anybody actually tries to implement them That's how we end up with all these lame duck spec extensions that are device implementation-specific and impossible to use generically. I scream as loudly as I can. But I am but one voice in a sea of device vendors. And unless we Linux developers start pushing back in unison we'll end up in a quagmire of epic proportions. Jens> In some ways I get it, you have to start somewhere. The current Jens> proposal is useful for _some_ cases, it's not great for Jens> everything. As long as it can be expanded to support as many Jens> streams as we would want, then it would work. It's (again) a bit Jens> of a chicken and egg problem. We need to make some progress, or Jens> the whole thing is going to go away. And I think that'd be a Jens> shame, since there's definitely merit to passing these lifetime Jens> hints to the device. The problem is that once these things end up in the standards, the CDB fields are gone. And if the clunky intermediate stuff is "good enough" then the motivation to fix things properly goes away. There are many, many reasons why stream IDs are a good thing. Above and beyond what the current proposals want. The notion of tagging is a much better abstraction than bootiness and guessing a percentage for how sequential future accesses might be. It's a simple, clean interface that the device--regardless of media type and implementation--can benefit from. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-07 19:19 ` Martin K. Petersen @ 2015-05-08 18:48 ` Jens Axboe 2015-05-12 2:50 ` Martin K. Petersen 0 siblings, 1 reply; 27+ messages in thread From: Jens Axboe @ 2015-05-08 18:48 UTC (permalink / raw) To: Martin K. Petersen Cc: Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david On 05/07/2015 01:19 PM, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: > > Jens> This wont solve the problem of devices having too few streams. But > Jens> it'll work regardless, we'll just have to push them separately to > Jens> do that. It's not an easy problem for them either, resource > Jens> constraints on the device side could exclude supporting as many > Jens> streams as we would ideally want. > > But they already have to manage *every* other resource that way: Read > cache, write cache, flash channels, open zones on ZAC/ZBC. If they run > out of memory and have to internally close one stream context to open > another that's their business. If the concurrent ID count is low, > performance their particular widgets is going to suck for some > applications and people will avoid them. Boo hoo. There are actual technical challenges on the device side that sometimes interferes. Say you currently implemented your flash device as a log based structure. The only way to even support streams is to have more logs you can append to. So perhaps then you can't support streams, boo hoo for them. Maybe you don't have a fixed log, you can write anywhere. But you can only have so many erase blocks open at one time. Not a huge concern, you have to manage that and open/close streams as you need to. That's basic resource management. But even if you do that, erase blocks are not small these days, even for big devices there are only so many of them (we're talking thousands in total, not millions). There's a very real lower upper bound there on what can be supported. It's easy enough being on the device side and punting everything to the OS. Why wouldn't you? Then it's out of your hair. At the same time, on the other side, there's also an OS tendency to whine and want everything, and helpfully all the time. The reality is that we can't demand that devices support thousands of streams. It'd be nice if they did and we didn't have to care at all, but realistically, that is not going to happen nor is it a completely sane demand. And while that may not be perfect, it's still a worthwhile improvement and wont preclude a hopefully rosier future where we have more streams. Lets say we have 8 streams now. We need some sort of policy to multiplex those streams. That's the current challenge. I can add the kernel managed streams, and I'll do that. > I'm super happy the SSD industry (well, the market) came to its senses > and abolished all the outrageous demands put on the I/O stack to > overcome erase block size and write amplification issues. Now all that's > a solved problem and we can move on. I would not say write amplification is a "solved issue", in fact we're attempting to improve it with this :-). But I know what you mean, and yes, that was a sad situation. I don't think it's a fair comparison, though. > Next problem child was the host managed zoned disk madness. Yet another > device implementation headache that suddenly requires us to reinvent > filesystems and the entire I/O stack. > > Next in the pipeline is the stream ID stuff. Which once again puts the > burden on us to overcome device implementation issues and misunderstands > how operating systems work. Again, I don't think that's a fair comparison. Write streams are useful. And adding support for write streams, even in a limited fashion, can be directly extended when/if more stream IDs are available. The only change would be in the management policy. The basic premise of open stream, use stream, close stream - those would be the same. I get that you don't like that we need to manually open and close streams, but honestly, those are useful hints to the device, even if they don't need to do anything about them explicitly. > There are two fundamental problems: > > - The standards are developed by device vendors with little to no input > from the OS vendors > > - The standards proposals are written, edited, and declared complete > before anybody actually tries to implement them > > That's how we end up with all these lame duck spec extensions that are > device implementation-specific and impossible to use generically. The write streams proposal was already approved by t10... > There are many, many reasons why stream IDs are a good thing. Above and > beyond what the current proposals want. The notion of tagging is a much > better abstraction than bootiness and guessing a percentage for how > sequential future accesses might be. It's a simple, clean interface that > the device--regardless of media type and implementation--can benefit > from. Exactly, which is why the streams is the first hinting mechanism that I actually think can work. The previous stuff has been utter crap, and I've always ignored it for exactly that reason. If we want/need policy on top of the streams, we can implement that independently. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2] Support for write stream IDs 2015-05-08 18:48 ` Jens Axboe @ 2015-05-12 2:50 ` Martin K. Petersen 0 siblings, 0 replies; 27+ messages in thread From: Martin K. Petersen @ 2015-05-12 2:50 UTC (permalink / raw) To: Jens Axboe Cc: Martin K. Petersen, Jeff Moyer, linux-kernel, linux-fsdevel, adilger, david >>>>> "Jens" == Jens Axboe <axboe@fb.com> writes: Jens, Jens> There are actual technical challenges on the device side that Jens> sometimes interferes. [...] Right now we use the same protocol to speak to USB keys and million dollar storage arrays. That's because the protocol was designed to be abstract and completely device agnostic. What's happening with flash devices and SMR is that all of a sudden device implementation challenges are being addressed by putting them in the protocol and punting them to the OS. That's an obvious and cost-saving approach for a device vendor to take. But the world would be a different place if we were still dealing with MFM, RLL, C/H/S addressing and other implementation-specific horrors of the past. And if that approach had continued we would explicitly have to deal with erase blocks on USB sticks and manually drive RAID logic inside disk arrays. But thankfully, with a few exceptions, we didn't go there. My beef with the current stream ID stuff and ZAC/ZBC is that those are steps in the wrong direction in that they are both exclusively focused on addressing implementation challenges specific to certain kinds of devices. The notion of letting the OS tag things as belonging together or being independent is a useful concept that benefits *any* kind of device. Those tags can easily be mapped to resource streams in a flash device or a particular zone cache segment on an SMR drive or in an array. I would just like the tag to be completely arbitrary so we can manage it on behalf of all applications and devices. That puts the burden on the device to manage the OS tag to internal resource mapping but I think that's a small price to pay to have a concept that works for all classes of devices, software RAID, etc. This does not in any way preclude the device communicating "I'd prefer if you only kept 8 streams going/8 zones open" like we do with all the other device characteristics. My gripe is that the programming model is being forcefully changed so we now have to get a permit before submitting an I/O. And very aggressively clean up since the permits are being envisioned as super scarce. Jens> The reality is that we can't demand that devices support thousands Jens> of streams. Why not? It's just a number. Tracking a large number of independent streams hasn't been a problem for storage arrays. Nobody says a 32-bit ID requires you to concurrently track bazillions of streams. Pick a reasonable number of live contexts given your device's actual resources. Jens> The write streams proposal was already approved by t10... Nope. It's still being discussed. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2015-05-12 2:50 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-04-18 20:03 [PATCH v2] Support for write stream IDs Jens Axboe 2015-04-18 20:03 ` [PATCH 1/7] block: add support for carrying a stream ID in a bio Jens Axboe 2015-04-18 20:03 ` [PATCH 2/7] Add support for per-file/inode stream ID Jens Axboe 2015-04-18 20:03 ` [PATCH 3/7] direct-io: add support for write stream IDs Jens Axboe 2015-04-18 20:03 ` [PATCH 4/7] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe 2015-04-18 20:03 ` [PATCH 5/7] btrfs: add support for write stream IDs Jens Axboe 2015-04-18 20:03 ` [PATCH 6/7] xfs: add support for buffered writeback stream ID Jens Axboe 2015-04-18 20:03 ` [PATCH 7/7] ext4: add support for write stream IDs Jens Axboe -- strict thread matches above, loose matches on Subject: below -- 2015-05-05 20:02 [PATCH v2] Support " Jens Axboe 2015-05-05 20:07 ` Christoph Hellwig 2015-05-05 20:12 ` Jens Axboe 2015-05-05 20:20 ` Christoph Hellwig 2015-05-05 20:31 ` Jens Axboe 2015-05-05 20:43 ` Christoph Hellwig 2015-05-05 20:50 ` Christoph Hellwig 2015-05-05 20:51 ` Jeff Moyer 2015-05-05 21:05 ` Jens Axboe 2015-05-05 21:39 ` Martin K. Petersen 2015-05-05 21:48 ` Jens Axboe 2015-05-05 22:09 ` Martin K. Petersen 2015-05-06 14:26 ` Peter Zijlstra 2015-05-06 17:25 ` Jens Axboe 2015-05-06 16:50 ` Boaz Harrosh 2015-05-06 17:21 ` Jens Axboe 2015-05-07 19:19 ` Martin K. Petersen 2015-05-08 18:48 ` Jens Axboe 2015-05-12 2:50 ` Martin K. Petersen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).