* [for 4.5] fix regression for large NVMe user command
@ 2016-03-02 17:07 Christoph Hellwig
2016-03-02 17:07 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Christoph Hellwig
` (4 more replies)
0 siblings, 5 replies; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 17:07 UTC (permalink / raw)
Hi Jens, hi Keith,
Jeff reported and issue to me where my change of the NVMe userspace
passthrough ioctls to the generic block code caused a regression for a
firmward dump tool that uses 1MB+ vendor specific userspace command
to dump a large buffer.
While (re)implementing support for multiple bios in blk_rq_map_user
I also ran into lots of existing NVMe bugs that limit I/O size either
in general or specific to some workloads:
(1) we never set BLK_MQ_F_SHOULD_MERGE for admin commands. This doesn't
really affect us here, but we should be consistant with the I/O queue.
(2) we never set any queue limits for the admin queue, just leaving
the defaults in place. Besides causing low I/O limits this also means
that flags like the virt boundary weren't set and we might pass on
incorrectly formed SGLs to the driver for admin passthrough.
(3) because the max_segments field in the queue limits structure is an
unsigned short we get an integer truncation that would cause all
NVMe controller that don't set a low MDTS value to actually just get
a single segment per request.
(4) last but not least we were applying the low FS request limits to all
driver private request types, which doesn't make sense.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue
2016-03-02 17:07 [for 4.5] fix regression for large NVMe user command Christoph Hellwig
@ 2016-03-02 17:07 ` Christoph Hellwig
2016-03-02 17:35 ` Sagi Grimberg
2016-03-02 20:01 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Jens Axboe
2016-03-02 17:07 ` [PATCH 2/5] nvme: set queue limits " Christoph Hellwig
` (3 subsequent siblings)
4 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 17:07 UTC (permalink / raw)
There is no reason to treat the admin queue different in terms of
request merging.
Signed-off-by: Christoph Hellwig <hch at lst.de>
Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
---
drivers/nvme/host/pci.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 680f578..19be56a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->admin_tagset.timeout = ADMIN_TIMEOUT;
dev->admin_tagset.numa_node = dev_to_node(dev->dev);
dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+ dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->admin_tagset.driver_data = dev;
if (blk_mq_alloc_tag_set(&dev->admin_tagset))
--
2.1.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/5] nvme: set queue limits for the admin queue
2016-03-02 17:07 [for 4.5] fix regression for large NVMe user command Christoph Hellwig
2016-03-02 17:07 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Christoph Hellwig
@ 2016-03-02 17:07 ` Christoph Hellwig
2016-03-02 19:10 ` Keith Busch
2016-03-02 17:07 ` [PATCH 3/5] nvme: fix max_segments integer truncation Christoph Hellwig
` (2 subsequent siblings)
4 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 17:07 UTC (permalink / raw)
Factor out a helper to set all the device specific queue limits and apply
them to the admin queue in addition to the I/O queues. Without this the
command size on the admin queue is arbitrarily low, and the missing
other limitations are just minefields waiting for victims.
Signed-off-by: Christoph Hellwig <hch at lst.de>
Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
---
drivers/nvme/host/core.c | 29 +++++++++++++++++++----------
1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 470d4f3..cfee6ac 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -840,6 +840,21 @@ int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl)
return ret;
}
+static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
+ struct request_queue *q)
+{
+ if (ctrl->max_hw_sectors) {
+ blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
+ blk_queue_max_segments(q,
+ (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
+ }
+ if (ctrl->stripe_size)
+ blk_queue_chunk_sectors(q, ctrl->stripe_size >> 9);
+ if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
+ blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
+ blk_queue_virt_boundary(q, ctrl->page_size - 1);
+}
+
/*
* Initialize the cached copies of the Identify data and various controller
* register in our nvme_ctrl structure. This should be called as soon as
@@ -897,6 +912,8 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
}
}
+ nvme_set_queue_limits(ctrl, ctrl->admin_q);
+
kfree(id);
return 0;
}
@@ -1147,17 +1164,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns->disk = disk;
ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
+
blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
- if (ctrl->max_hw_sectors) {
- blk_queue_max_hw_sectors(ns->queue, ctrl->max_hw_sectors);
- blk_queue_max_segments(ns->queue,
- (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
- }
- if (ctrl->stripe_size)
- blk_queue_chunk_sectors(ns->queue, ctrl->stripe_size >> 9);
- if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
- blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
- blk_queue_virt_boundary(ns->queue, ctrl->page_size - 1);
+ nvme_set_queue_limits(ctrl, ns->queue);
disk->major = nvme_major;
disk->first_minor = 0;
--
2.1.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/5] nvme: fix max_segments integer truncation
2016-03-02 17:07 [for 4.5] fix regression for large NVMe user command Christoph Hellwig
2016-03-02 17:07 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Christoph Hellwig
2016-03-02 17:07 ` [PATCH 2/5] nvme: set queue limits " Christoph Hellwig
@ 2016-03-02 17:07 ` Christoph Hellwig
2016-03-02 18:27 ` Keith Busch
2016-03-02 17:07 ` [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests Christoph Hellwig
2016-03-02 17:07 ` [PATCH 5/5] block: support large requests in blk_rq_map_user_iov Christoph Hellwig
4 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 17:07 UTC (permalink / raw)
The block layer uses an unsigned short for max_segments. The way we
calculate the value for NVMe tends to generate very large 32-bit values,
which after integer truncation may lead to a zero value instead of
the desired outcome.
Signed-off-by: Christoph Hellwig <hch at lst.de>
Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
---
drivers/nvme/host/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cfee6ac..c30cb10 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -844,9 +844,11 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
struct request_queue *q)
{
if (ctrl->max_hw_sectors) {
+ u32 max_segments =
+ (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1;
+
blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
- blk_queue_max_segments(q,
- (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
+ blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
}
if (ctrl->stripe_size)
blk_queue_chunk_sectors(q, ctrl->stripe_size >> 9);
--
2.1.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests
2016-03-02 17:07 [for 4.5] fix regression for large NVMe user command Christoph Hellwig
` (2 preceding siblings ...)
2016-03-02 17:07 ` [PATCH 3/5] nvme: fix max_segments integer truncation Christoph Hellwig
@ 2016-03-02 17:07 ` Christoph Hellwig
2016-03-02 19:17 ` Keith Busch
2016-03-02 17:07 ` [PATCH 5/5] block: support large requests in blk_rq_map_user_iov Christoph Hellwig
4 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 17:07 UTC (permalink / raw)
Driver private request types should not get the artifical cap for the
FS requests. This is important to use the full device capabilities
for internal command or NVMe pass through commands.
Signed-off-by: Christoph Hellwig <hch at lst.de>
Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
---
include/linux/blkdev.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d7f6bca..f831dc0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -895,7 +895,7 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq)
{
struct request_queue *q = rq->q;
- if (unlikely(rq->cmd_type == REQ_TYPE_BLOCK_PC))
+ if (unlikely(rq->cmd_type >= REQ_TYPE_BLOCK_PC))
return q->limits.max_hw_sectors;
if (!q->limits.chunk_sectors || (rq->cmd_flags & REQ_DISCARD))
--
2.1.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 5/5] block: support large requests in blk_rq_map_user_iov
2016-03-02 17:07 [for 4.5] fix regression for large NVMe user command Christoph Hellwig
` (3 preceding siblings ...)
2016-03-02 17:07 ` [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests Christoph Hellwig
@ 2016-03-02 17:07 ` Christoph Hellwig
2016-03-02 19:34 ` Keith Busch
4 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 17:07 UTC (permalink / raw)
This patch adds support for larger requests in blk_rq_map_user_iov by
allowing it to build multiple bios for a request. This functionality
used to exist for the non-vectored blk_rq_map_user in the past, and
this patch reuses the existing functionality for it on the unmap side,
which stuck around. Thanks to the iov_iter API supporting multiple
bios is fairly trivial, as we can just iterate the iov until we've
consumed the whole iov_iter.
Signed-off-by: Christoph Hellwig <hch at lst.de>
Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
---
block/blk-map.c | 91 ++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 61 insertions(+), 30 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index f565e11..a54f054 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -57,6 +57,49 @@ static int __blk_rq_unmap_user(struct bio *bio)
return ret;
}
+static int __blk_rq_map_user_iov(struct request *rq,
+ struct rq_map_data *map_data, struct iov_iter *iter,
+ gfp_t gfp_mask, bool copy)
+{
+ struct request_queue *q = rq->q;
+ struct bio *bio, *orig_bio;
+ int ret;
+
+ if (copy)
+ bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
+ else
+ bio = bio_map_user_iov(q, iter, gfp_mask);
+
+ if (IS_ERR(bio))
+ return PTR_ERR(bio);
+
+ if (map_data && map_data->null_mapped)
+ bio_set_flag(bio, BIO_NULL_MAPPED);
+
+ iov_iter_advance(iter, bio->bi_iter.bi_size);
+ if (map_data)
+ map_data->offset += bio->bi_iter.bi_size;
+
+ orig_bio = bio;
+ blk_queue_bounce(q, &bio);
+
+ /*
+ * We link the bounce buffer in and could have to traverse it
+ * later so we have to get a ref to prevent it from being freed
+ */
+ bio_get(bio);
+
+ ret = blk_rq_append_bio(q, rq, bio);
+ if (ret) {
+ bio_endio(bio);
+ __blk_rq_unmap_user(orig_bio);
+ bio_put(bio);
+ return ret;
+ }
+
+ return 0;
+}
+
/**
* blk_rq_map_user_iov - map user data to a request, for REQ_TYPE_BLOCK_PC usage
* @q: request queue where request should be inserted
@@ -82,10 +125,11 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
struct rq_map_data *map_data,
const struct iov_iter *iter, gfp_t gfp_mask)
{
- struct bio *bio;
- int unaligned = 0;
- struct iov_iter i;
struct iovec iov, prv = {.iov_base = NULL, .iov_len = 0};
+ bool copy = (q->dma_pad_mask & iter->count) || map_data;
+ struct bio *bio = NULL;
+ struct iov_iter i;
+ int ret;
if (!iter || !iter->count)
return -EINVAL;
@@ -101,42 +145,29 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
*/
if ((uaddr & queue_dma_alignment(q)) ||
iovec_gap_to_prv(q, &prv, &iov))
- unaligned = 1;
+ copy = true;
prv.iov_base = iov.iov_base;
prv.iov_len = iov.iov_len;
}
- if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
- bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
- else
- bio = bio_map_user_iov(q, iter, gfp_mask);
-
- if (IS_ERR(bio))
- return PTR_ERR(bio);
-
- if (map_data && map_data->null_mapped)
- bio_set_flag(bio, BIO_NULL_MAPPED);
-
- if (bio->bi_iter.bi_size != iter->count) {
- /*
- * Grab an extra reference to this bio, as bio_unmap_user()
- * expects to be able to drop it twice as it happens on the
- * normal IO completion path
- */
- bio_get(bio);
- bio_endio(bio);
- __blk_rq_unmap_user(bio);
- return -EINVAL;
- }
+ i = *iter;
+ do {
+ ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy);
+ if (ret)
+ goto unmap_rq;
+ if (!bio)
+ bio = rq->bio;
+ } while (iov_iter_count(&i));
if (!bio_flagged(bio, BIO_USER_MAPPED))
rq->cmd_flags |= REQ_COPY_USER;
-
- blk_queue_bounce(q, &bio);
- bio_get(bio);
- blk_rq_bio_prep(q, rq, bio);
return 0;
+
+unmap_rq:
+ __blk_rq_unmap_user(bio);
+ rq->bio = NULL;
+ return -EINVAL;
}
EXPORT_SYMBOL(blk_rq_map_user_iov);
--
2.1.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue
2016-03-02 17:07 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Christoph Hellwig
@ 2016-03-02 17:35 ` Sagi Grimberg
2016-03-02 18:15 ` Keith Busch
2016-03-02 21:17 ` Christoph Hellwig
2016-03-02 20:01 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Jens Axboe
1 sibling, 2 replies; 22+ messages in thread
From: Sagi Grimberg @ 2016-03-02 17:35 UTC (permalink / raw)
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 680f578..19be56a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> dev->admin_tagset.timeout = ADMIN_TIMEOUT;
> dev->admin_tagset.numa_node = dev_to_node(dev->dev);
> dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
> + dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
I still don't understand what's the point?
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue
2016-03-02 17:35 ` Sagi Grimberg
@ 2016-03-02 18:15 ` Keith Busch
2016-03-02 21:17 ` Christoph Hellwig
1 sibling, 0 replies; 22+ messages in thread
From: Keith Busch @ 2016-03-02 18:15 UTC (permalink / raw)
On Wed, Mar 02, 2016@07:35:56PM +0200, Sagi Grimberg wrote:
>
> >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> >index 680f578..19be56a 100644
> >--- a/drivers/nvme/host/pci.c
> >+++ b/drivers/nvme/host/pci.c
> >@@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> > dev->admin_tagset.timeout = ADMIN_TIMEOUT;
> > dev->admin_tagset.numa_node = dev_to_node(dev->dev);
> > dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
> >+ dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
>
> I still don't understand what's the point?
I agree, this looks like a no-op. The admin request queue isn't for
block data, and the admin queue only uses cmd_type REQ_TYPE_DRV_PRIV. I
believe only requests of REQ_TYPE_FS type are mergable.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/5] nvme: fix max_segments integer truncation
2016-03-02 17:07 ` [PATCH 3/5] nvme: fix max_segments integer truncation Christoph Hellwig
@ 2016-03-02 18:27 ` Keith Busch
0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2016-03-02 18:27 UTC (permalink / raw)
On Wed, Mar 02, 2016@06:07:12PM +0100, Christoph Hellwig wrote:
> The block layer uses an unsigned short for max_segments. The way we
> calculate the value for NVMe tends to generate very large 32-bit values,
> which after integer truncation may lead to a zero value instead of
> the desired outcome.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
> Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
Looks good.
Reviewed-by: Keith Busch <keith.busch at intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/5] nvme: set queue limits for the admin queue
2016-03-02 17:07 ` [PATCH 2/5] nvme: set queue limits " Christoph Hellwig
@ 2016-03-02 19:10 ` Keith Busch
0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2016-03-02 19:10 UTC (permalink / raw)
On Wed, Mar 02, 2016@06:07:11PM +0100, Christoph Hellwig wrote:
> Factor out a helper to set all the device specific queue limits and apply
> them to the admin queue in addition to the I/O queues. Without this the
> command size on the admin queue is arbitrarily low, and the missing
> other limitations are just minefields waiting for victims.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
> Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
Looks fine. The chunk size and flush settings don't hurt, but are
unnecessary for an admin queue.
Reviewed-by: Keith Busch <keith.busch at intel.com>
> +static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
> + struct request_queue *q)
> +{
> + if (ctrl->max_hw_sectors) {
> + blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
> + blk_queue_max_segments(q,
> + (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
> + }
> + if (ctrl->stripe_size)
> + blk_queue_chunk_sectors(q, ctrl->stripe_size >> 9);
> + if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
> + blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
> + blk_queue_virt_boundary(q, ctrl->page_size - 1);
> +}
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests
2016-03-02 17:07 ` [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests Christoph Hellwig
@ 2016-03-02 19:17 ` Keith Busch
2016-03-02 21:18 ` Christoph Hellwig
0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2016-03-02 19:17 UTC (permalink / raw)
On Wed, Mar 02, 2016@06:07:13PM +0100, Christoph Hellwig wrote:
> Driver private request types should not get the artifical cap for the
> FS requests. This is important to use the full device capabilities
> for internal command or NVMe pass through commands.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
> Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
> ---
Looks good if we can depend on the enum order. Maybe a more explicit
check for the one type that does support checking chunk sectors instead:
if (unlikely(rq->cmd_type != REQ_TYPE_FS))
But it looks fine as-is, too.
Reviewed-by: Keith Busch <keith.busch at intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/5] block: support large requests in blk_rq_map_user_iov
2016-03-02 17:07 ` [PATCH 5/5] block: support large requests in blk_rq_map_user_iov Christoph Hellwig
@ 2016-03-02 19:34 ` Keith Busch
2016-03-02 21:20 ` Christoph Hellwig
0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2016-03-02 19:34 UTC (permalink / raw)
On Wed, Mar 02, 2016@06:07:14PM +0100, Christoph Hellwig wrote:
> This patch adds support for larger requests in blk_rq_map_user_iov by
> allowing it to build multiple bios for a request. This functionality
> used to exist for the non-vectored blk_rq_map_user in the past, and
> this patch reuses the existing functionality for it on the unmap side,
> which stuck around. Thanks to the iov_iter API supporting multiple
> bios is fairly trivial, as we can just iterate the iov until we've
> consumed the whole iov_iter.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
> Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
This looks good too, though I think patch 2/5 in this series on its
own should have fixed the transfer issues for NVMe. AFAICT, the only
reason the iterator couldn't be fully copied into a single bio is if
blk_add_pc_page already hit a queue limit, and blk_rq_append_bio would
fail for the same reason.
Reviewed-by: Keith Busch <keith.busch at intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue
2016-03-02 17:07 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Christoph Hellwig
2016-03-02 17:35 ` Sagi Grimberg
@ 2016-03-02 20:01 ` Jens Axboe
1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2016-03-02 20:01 UTC (permalink / raw)
On 03/02/2016 10:07 AM, Christoph Hellwig wrote:
> There is no reason to treat the admin queue different in terms of
> request merging.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reported-by: Jeff Lien <Jeff.Lien at hgst.com>
> Tested-by: Jeff Lien <Jeff.Lien at hgst.com>
> ---
> drivers/nvme/host/pci.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 680f578..19be56a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> dev->admin_tagset.timeout = ADMIN_TIMEOUT;
> dev->admin_tagset.numa_node = dev_to_node(dev->dev);
> dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
> + dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
> dev->admin_tagset.driver_data = dev;
>
> if (blk_mq_alloc_tag_set(&dev->admin_tagset))
Agree with Keith and others that his doesn't make a lot of sense. It's
not that it'll break anything, but it won't change anything either.
I've applied 2-5/5 to for-linus. I guess that's one more day or not
shipping to Linus...
--
Jens Axboe
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue
2016-03-02 17:35 ` Sagi Grimberg
2016-03-02 18:15 ` Keith Busch
@ 2016-03-02 21:17 ` Christoph Hellwig
2016-03-02 23:55 ` Ming Lei
1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 21:17 UTC (permalink / raw)
On Wed, Mar 02, 2016@07:35:56PM +0200, Sagi Grimberg wrote:
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 680f578..19be56a 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>> dev->admin_tagset.timeout = ADMIN_TIMEOUT;
>> dev->admin_tagset.numa_node = dev_to_node(dev->dev);
>> dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
>> + dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
>
> I still don't understand what's the point?
still? :)
But yes, we're not going to hit the merge case for the passthrough
command indeed. Except for keeping the flags in sync, which sounds
like a good ?dea in general there is not reall need for this one.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests
2016-03-02 19:17 ` Keith Busch
@ 2016-03-02 21:18 ` Christoph Hellwig
2016-03-02 22:16 ` Jens Axboe
0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 21:18 UTC (permalink / raw)
On Wed, Mar 02, 2016@07:17:05PM +0000, Keith Busch wrote:
> check for the one type that does support checking chunk sectors instead:
>
> if (unlikely(rq->cmd_type != REQ_TYPE_FS))
>
> But it looks fine as-is, too.
Yes, that's probably better in the long run. Jens, can you fold the
suggestion from Keith into the patch?
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/5] block: support large requests in blk_rq_map_user_iov
2016-03-02 19:34 ` Keith Busch
@ 2016-03-02 21:20 ` Christoph Hellwig
0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-02 21:20 UTC (permalink / raw)
On Wed, Mar 02, 2016@07:34:22PM +0000, Keith Busch wrote:
> This looks good too, though I think patch 2/5 in this series on its
> own should have fixed the transfer issues for NVMe. AFAICT, the only
> reason the iterator couldn't be fully copied into a single bio is if
> blk_add_pc_page already hit a queue limit, and blk_rq_append_bio would
> fail for the same reason.
For requests <= 1MB that's true. But the firmware dump is using larger
transfers, and unless we can merge enough fragments we're still going
to need multiple bios.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests
2016-03-02 21:18 ` Christoph Hellwig
@ 2016-03-02 22:16 ` Jens Axboe
0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2016-03-02 22:16 UTC (permalink / raw)
On 03/02/2016 02:18 PM, Christoph Hellwig wrote:
> On Wed, Mar 02, 2016@07:17:05PM +0000, Keith Busch wrote:
>> check for the one type that does support checking chunk sectors instead:
>>
>> if (unlikely(rq->cmd_type != REQ_TYPE_FS))
>>
>> But it looks fine as-is, too.
>
> Yes, that's probably better in the long run. Jens, can you fold the
> suggestion from Keith into the patch?
Too late to fold, but I fixed it separately.
--
Jens Axboe
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue
2016-03-02 21:17 ` Christoph Hellwig
@ 2016-03-02 23:55 ` Ming Lei
[not found] ` <20160303084609.GA14051@lst.de>
0 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2016-03-02 23:55 UTC (permalink / raw)
On Thu, Mar 3, 2016@5:17 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Mar 02, 2016@07:35:56PM +0200, Sagi Grimberg wrote:
>>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>> index 680f578..19be56a 100644
>>> --- a/drivers/nvme/host/pci.c
>>> +++ b/drivers/nvme/host/pci.c
>>> @@ -1289,6 +1289,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>>> dev->admin_tagset.timeout = ADMIN_TIMEOUT;
>>> dev->admin_tagset.numa_node = dev_to_node(dev->dev);
>>> dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
>>> + dev->admin_tagset.flags = BLK_MQ_F_SHOULD_MERGE;
>>
>> I still don't understand what's the point?
>
> still? :)
>
> But yes, we're not going to hit the merge case for the passthrough
> command indeed. Except for keeping the flags in sync, which sounds
> like a good ?dea in general there is not reall need for this one.
NO_SG_MERGE has been bypassed after bio splitting is introduced,
and looks it should have been removed.
--
Ming Lei
^ permalink raw reply [flat|nested] 22+ messages in thread
* blk-mq merge flags
[not found] ` <20160303084609.GA14051@lst.de>
@ 2016-03-03 11:24 ` Ming Lei
2016-03-03 11:59 ` Christoph Hellwig
2016-03-15 16:08 ` Christoph Hellwig
1 sibling, 1 reply; 22+ messages in thread
From: Ming Lei @ 2016-03-03 11:24 UTC (permalink / raw)
On Thu, Mar 3, 2016@4:46 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 03, 2016@07:55:15AM +0800, Ming Lei wrote:
>> > command indeed. Except for keeping the flags in sync, which sounds
>> > like a good ?dea in general there is not reall need for this one.
>>
>> NO_SG_MERGE has been bypassed after bio splitting is introduced,
>> and looks it should have been removed.
>
> Partially at least. Anyway, let's start a discussion on blk-mq merge
> flags.
>
> First I really hate the way how any merges are opt-in. I'd be much
> happier with an opt-out to get the sane behavior by default. Every
> drivers set BLK_MQ_F_SHOULD_MERGE, and while only a few drivers set
> BLK_MQ_F_SG_MERGE I suspect all should as well. Especially as the
> SG merging is cheaper and more useful than the bio mergig in many
> cases.
After bio splitting is introduced, SG merging becomes much cheaper, and
I can't see the difference between SG_MERGE and NO_SG_MERGE when
doing test over null_blk.
Once multipage bvecs is supported, SG merging should be always because
the bvec stored in bio->bi_io_vec can be thought as sort of segment.
Also from hardware view, it is often more efficient to handle less segments.
Last time Keith mentioned the point too about NVMe.
That is why I think we might consider to remove the flag.
--
Ming Lei
^ permalink raw reply [flat|nested] 22+ messages in thread
* blk-mq merge flags
2016-03-03 11:24 ` blk-mq merge flags Ming Lei
@ 2016-03-03 11:59 ` Christoph Hellwig
0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-03 11:59 UTC (permalink / raw)
On Thu, Mar 03, 2016@07:24:10PM +0800, Ming Lei wrote:
> After bio splitting is introduced, SG merging becomes much cheaper, and
> I can't see the difference between SG_MERGE and NO_SG_MERGE when
> doing test over null_blk.
Ok.
> Once multipage bvecs is supported, SG merging should be always because
> the bvec stored in bio->bi_io_vec can be thought as sort of segment.
>
> Also from hardware view, it is often more efficient to handle less segments.
> Last time Keith mentioned the point too about NVMe.
>
> That is why I think we might consider to remove the flag.
I hate these sorts of flags, but we have to weight this against the
reason they were introduced. I think the rational was for workloads
that do small (4k or less) I/Os and really care about latency.
But I'd be much happier if we'd only allow this to be turned off at
runtime through sysfs, and not requiring the drivers to opt for
sane behavior at compile time.
I also agree about the multipage biovecs - if we build large iovecs
directly in bio_add_page we get S/G merging for a two comparisms and
a branch, so it'll be essentially for free. But for that we need
multipage biovecs first.
^ permalink raw reply [flat|nested] 22+ messages in thread
* blk-mq merge flags
[not found] ` <20160303084609.GA14051@lst.de>
2016-03-03 11:24 ` blk-mq merge flags Ming Lei
@ 2016-03-15 16:08 ` Christoph Hellwig
2016-03-15 22:13 ` Jens Axboe
1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-15 16:08 UTC (permalink / raw)
Jens, Keith -
any comments? I think the lack of BLK_MQ_F_SG_MERGE for nvme
is rather harmful at the moment, as I get much better request
packing with it. I'd rather fix this in the block layer, but
if that's not ok I can also send a pure NVMe patch.
^ permalink raw reply [flat|nested] 22+ messages in thread
* blk-mq merge flags
2016-03-15 16:08 ` Christoph Hellwig
@ 2016-03-15 22:13 ` Jens Axboe
0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2016-03-15 22:13 UTC (permalink / raw)
On 03/15/2016 09:08 AM, Christoph Hellwig wrote:
> Jens, Keith -
>
> any comments? I think the lack of BLK_MQ_F_SG_MERGE for nvme
> is rather harmful at the moment, as I get much better request
> packing with it. I'd rather fix this in the block layer, but
> if that's not ok I can also send a pure NVMe patch.
We can default to merging, and have drivers opt out, if they need to.
NVMe is probably the only one that doesn't set the flag currently, so
the net effect of the two different approaches would be the same. We'll
have a slightly higher CPU overhead without a perf win for workloads
that don't get merging, but overall it's always a win if we can merge
commands. So a better default, I think, even for nvme.
--
Jens Axboe
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2016-03-15 22:13 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-02 17:07 [for 4.5] fix regression for large NVMe user command Christoph Hellwig
2016-03-02 17:07 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Christoph Hellwig
2016-03-02 17:35 ` Sagi Grimberg
2016-03-02 18:15 ` Keith Busch
2016-03-02 21:17 ` Christoph Hellwig
2016-03-02 23:55 ` Ming Lei
[not found] ` <20160303084609.GA14051@lst.de>
2016-03-03 11:24 ` blk-mq merge flags Ming Lei
2016-03-03 11:59 ` Christoph Hellwig
2016-03-15 16:08 ` Christoph Hellwig
2016-03-15 22:13 ` Jens Axboe
2016-03-02 20:01 ` [PATCH 1/5] nvme: also set BLK_MQ_F_SHOULD_MERGE for the admin queue Jens Axboe
2016-03-02 17:07 ` [PATCH 2/5] nvme: set queue limits " Christoph Hellwig
2016-03-02 19:10 ` Keith Busch
2016-03-02 17:07 ` [PATCH 3/5] nvme: fix max_segments integer truncation Christoph Hellwig
2016-03-02 18:27 ` Keith Busch
2016-03-02 17:07 ` [PATCH 4/5] block: fix blk_rq_get_max_sectors for driver private requests Christoph Hellwig
2016-03-02 19:17 ` Keith Busch
2016-03-02 21:18 ` Christoph Hellwig
2016-03-02 22:16 ` Jens Axboe
2016-03-02 17:07 ` [PATCH 5/5] block: support large requests in blk_rq_map_user_iov Christoph Hellwig
2016-03-02 19:34 ` Keith Busch
2016-03-02 21:20 ` Christoph Hellwig
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.