linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Performance enhancements
@ 2017-12-21 20:46 Keith Busch
  2017-12-21 20:46 ` [PATCH 1/3] nvme/pci: Start request after doorbell ring Keith Busch
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Keith Busch @ 2017-12-21 20:46 UTC (permalink / raw)
  To: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg
  Cc: Keith Busch

A few IO micro-optimizations for IO polling and NVMe. I'm really working
to close the performance gap with userspace drivers, and this gets me
halfway there on latency. The fastest hardware I could get measured
roundtrip read latency at 5usec with this series that was previously
measuring 5.7usec.

Note with NVMe, you really need to crank up the interrupt coalescing to
see the completion polling benefit.

Test pre-setup:

  echo performance | tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor
  echo 0 > /sys/block/nvme0n1/queue/iostats
  echo -1 > /sys/block/nvme0n1/queue/io_poll_delay
  nvme set-feature /dev/nvme0 -f 8 -v 0x4ff

fio profile:

  [global]
  ioengine=pvsync2
  rw=randread
  norandommap
  direct=1
  bs=4k
  hipri

  [hi-pri]
  filename=/dev/nvme0n1
  cpus_allowed=2

Keith Busch (3):
  nvme/pci: Start request after doorbell ring
  nvme/pci: Remove cq_vector check in IO path
  block: Polling completion performance optimization

 drivers/nvme/host/pci.c | 14 +++-----------
 fs/block_dev.c          |  5 ++++-
 2 files changed, 7 insertions(+), 12 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 20:46 [PATCH 0/3] Performance enhancements Keith Busch
@ 2017-12-21 20:46 ` Keith Busch
  2017-12-21 20:49   ` Jens Axboe
                     ` (2 more replies)
  2017-12-21 20:46 ` [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path Keith Busch
  2017-12-21 20:46 ` [PATCH 3/3] block: Polling completion performance optimization Keith Busch
  2 siblings, 3 replies; 34+ messages in thread
From: Keith Busch @ 2017-12-21 20:46 UTC (permalink / raw)
  To: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg
  Cc: Keith Busch

This is a performance optimization that allows the hardware to work on
a command in parallel with the kernel's stats and timeout tracking.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3c9082..df5550ce0531 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 			goto out_cleanup_iod;
 	}
 
-	blk_mq_start_request(req);
-
 	spin_lock_irq(&nvmeq->q_lock);
 	if (unlikely(nvmeq->cq_vector < 0)) {
 		ret = BLK_STS_IOERR;
@@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 		goto out_cleanup_iod;
 	}
 	__nvme_submit_cmd(nvmeq, &cmnd);
+	blk_mq_start_request(req);
 	nvme_process_cq(nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
 	return BLK_STS_OK;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-21 20:46 [PATCH 0/3] Performance enhancements Keith Busch
  2017-12-21 20:46 ` [PATCH 1/3] nvme/pci: Start request after doorbell ring Keith Busch
@ 2017-12-21 20:46 ` Keith Busch
  2017-12-21 20:54   ` Jens Axboe
  2017-12-25 10:10   ` Sagi Grimberg
  2017-12-21 20:46 ` [PATCH 3/3] block: Polling completion performance optimization Keith Busch
  2 siblings, 2 replies; 34+ messages in thread
From: Keith Busch @ 2017-12-21 20:46 UTC (permalink / raw)
  To: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg
  Cc: Keith Busch

This is a micro-optimization removing unnecessary check for a disabled
queue. We no longer need this check because blk-mq provides the ability
to quiesce queues that nvme uses, and the doorbell registers are never
unmapped as long as requests are active.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index df5550ce0531..0be5124a3a49 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -887,11 +887,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	}
 
 	spin_lock_irq(&nvmeq->q_lock);
-	if (unlikely(nvmeq->cq_vector < 0)) {
-		ret = BLK_STS_IOERR;
-		spin_unlock_irq(&nvmeq->q_lock);
-		goto out_cleanup_iod;
-	}
 	__nvme_submit_cmd(nvmeq, &cmnd);
 	blk_mq_start_request(req);
 	nvme_process_cq(nvmeq);
@@ -923,11 +918,9 @@ static inline void nvme_ring_cq_doorbell(struct nvme_queue *nvmeq)
 {
 	u16 head = nvmeq->cq_head;
 
-	if (likely(nvmeq->cq_vector >= 0)) {
-		if (nvme_dbbuf_update_and_check_event(head, nvmeq->dbbuf_cq_db,
+	if (nvme_dbbuf_update_and_check_event(head, nvmeq->dbbuf_cq_db,
 						      nvmeq->dbbuf_cq_ei))
-			writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
-	}
+		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
 }
 
 static inline void nvme_handle_cqe(struct nvme_queue *nvmeq,
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 20:46 [PATCH 0/3] Performance enhancements Keith Busch
  2017-12-21 20:46 ` [PATCH 1/3] nvme/pci: Start request after doorbell ring Keith Busch
  2017-12-21 20:46 ` [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path Keith Busch
@ 2017-12-21 20:46 ` Keith Busch
  2017-12-21 20:56   ` Scott Bauer
                     ` (2 more replies)
  2 siblings, 3 replies; 34+ messages in thread
From: Keith Busch @ 2017-12-21 20:46 UTC (permalink / raw)
  To: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg
  Cc: Keith Busch

When a request completion is polled, the completion task wakes itself
up. This is unnecessary, as the task can just set itself back to
running.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 fs/block_dev.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..ce67ffe73430 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
 	struct task_struct *waiter = bio->bi_private;
 
 	WRITE_ONCE(bio->bi_private, NULL);
-	wake_up_process(waiter);
+	if (current != waiter)
+		wake_up_process(waiter);
+	else
+		__set_current_state(TASK_RUNNING);
 }
 
 static ssize_t
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 20:46 ` [PATCH 1/3] nvme/pci: Start request after doorbell ring Keith Busch
@ 2017-12-21 20:49   ` Jens Axboe
  2017-12-21 20:53     ` Jens Axboe
  2017-12-25 10:11   ` Sagi Grimberg
  2017-12-29  9:44   ` Christoph Hellwig
  2 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2017-12-21 20:49 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On 12/21/17 1:46 PM, Keith Busch wrote:
> This is a performance optimization that allows the hardware to work on
> a command in parallel with the kernel's stats and timeout tracking.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/nvme/host/pci.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f5800c3c9082..df5550ce0531 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  			goto out_cleanup_iod;
>  	}
>  
> -	blk_mq_start_request(req);
> -
>  	spin_lock_irq(&nvmeq->q_lock);
>  	if (unlikely(nvmeq->cq_vector < 0)) {
>  		ret = BLK_STS_IOERR;
> @@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		goto out_cleanup_iod;
>  	}
>  	__nvme_submit_cmd(nvmeq, &cmnd);
> +	blk_mq_start_request(req);
>  	nvme_process_cq(nvmeq);
>  	spin_unlock_irq(&nvmeq->q_lock);
>  	return BLK_STS_OK;

I guess this will work since we hold the q_lock, but you should probably
include a comment to that effect since it's important that we can't find
the completion before we've called started.

Actually, you'd need to reorder this after #2 (I'm assuming, it hasn't
shown up yet, but I'm guessing it's kill the cq check after submission)
since if we have multiple CPUs on the same hardware queue, one of
the other CPUs could find a completion event between your
__nvme_submit_cmd() and blk_mq_start_request() call. Very short window,
but it does exist.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 20:49   ` Jens Axboe
@ 2017-12-21 20:53     ` Jens Axboe
  2017-12-21 21:02       ` Keith Busch
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2017-12-21 20:53 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On 12/21/17 1:49 PM, Jens Axboe wrote:
> On 12/21/17 1:46 PM, Keith Busch wrote:
>> This is a performance optimization that allows the hardware to work on
>> a command in parallel with the kernel's stats and timeout tracking.
>>
>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>> ---
>>  drivers/nvme/host/pci.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index f5800c3c9082..df5550ce0531 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -886,8 +886,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  			goto out_cleanup_iod;
>>  	}
>>  
>> -	blk_mq_start_request(req);
>> -
>>  	spin_lock_irq(&nvmeq->q_lock);
>>  	if (unlikely(nvmeq->cq_vector < 0)) {
>>  		ret = BLK_STS_IOERR;
>> @@ -895,6 +893,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  		goto out_cleanup_iod;
>>  	}
>>  	__nvme_submit_cmd(nvmeq, &cmnd);
>> +	blk_mq_start_request(req);
>>  	nvme_process_cq(nvmeq);
>>  	spin_unlock_irq(&nvmeq->q_lock);
>>  	return BLK_STS_OK;
> 
> I guess this will work since we hold the q_lock, but you should probably
> include a comment to that effect since it's important that we can't find
> the completion before we've called started.
> 
> Actually, you'd need to reorder this after #2 (I'm assuming, it hasn't
> shown up yet, but I'm guessing it's kill the cq check after submission)
> since if we have multiple CPUs on the same hardware queue, one of
> the other CPUs could find a completion event between your
> __nvme_submit_cmd() and blk_mq_start_request() call. Very short window,
> but it does exist.

Turns out that wasn't what patch 2 was. And the code is right there
above as well, and under the q_lock, so I guess that race doesn't
exist.

But that does bring up the fact if we should always be doing the
nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
maybe it's better to make the submission path faster and skip it?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-21 20:46 ` [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path Keith Busch
@ 2017-12-21 20:54   ` Jens Axboe
  2017-12-25 10:10   ` Sagi Grimberg
  1 sibling, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2017-12-21 20:54 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On 12/21/17 1:46 PM, Keith Busch wrote:
> This is a micro-optimization removing unnecessary check for a disabled
> queue. We no longer need this check because blk-mq provides the ability
> to quiesce queues that nvme uses, and the doorbell registers are never
> unmapped as long as requests are active.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 20:46 ` [PATCH 3/3] block: Polling completion performance optimization Keith Busch
@ 2017-12-21 20:56   ` Scott Bauer
  2017-12-21 21:00     ` Jens Axboe
  2017-12-21 20:57   ` Jens Axboe
  2017-12-29  9:51   ` Christoph Hellwig
  2 siblings, 1 reply; 34+ messages in thread
From: Scott Bauer @ 2017-12-21 20:56 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg



On 12/21/2017 01:46 PM, Keith Busch wrote:
> When a request completion is polled, the completion task wakes itself
> up. This is unnecessary, as the task can just set itself back to
> running.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  fs/block_dev.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb5175..ce67ffe73430 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>  	struct task_struct *waiter = bio->bi_private;
>  
>  	WRITE_ONCE(bio->bi_private, NULL);
> -	wake_up_process(waiter);
> +	if (current != waiter)
> +		wake_up_process(waiter);
> +	else
> +		__set_current_state(TASK_RUNNING);

Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?

Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 20:46 ` [PATCH 3/3] block: Polling completion performance optimization Keith Busch
  2017-12-21 20:56   ` Scott Bauer
@ 2017-12-21 20:57   ` Jens Axboe
  2017-12-29  9:51   ` Christoph Hellwig
  2 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2017-12-21 20:57 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On 12/21/17 1:46 PM, Keith Busch wrote:
> When a request completion is polled, the completion task wakes itself
> up. This is unnecessary, as the task can just set itself back to
> running.

Looks good to me, I can take it for 4.16 in the block tree.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 20:56   ` Scott Bauer
@ 2017-12-21 21:00     ` Jens Axboe
  2017-12-21 21:34       ` Keith Busch
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2017-12-21 21:00 UTC (permalink / raw)
  To: Scott Bauer, Keith Busch, linux-nvme, linux-block,
	Christoph Hellwig, Sagi Grimberg

On 12/21/17 1:56 PM, Scott Bauer wrote:
> 
> 
> On 12/21/2017 01:46 PM, Keith Busch wrote:
>> When a request completion is polled, the completion task wakes itself
>> up. This is unnecessary, as the task can just set itself back to
>> running.
>>
>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>> ---
>>  fs/block_dev.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 4a181fcb5175..ce67ffe73430 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>>  	struct task_struct *waiter = bio->bi_private;
>>  
>>  	WRITE_ONCE(bio->bi_private, NULL);
>> -	wake_up_process(waiter);
>> +	if (current != waiter)
>> +		wake_up_process(waiter);
>> +	else
>> +		__set_current_state(TASK_RUNNING);
> 
> Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?
> 
> Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).

We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that
should be removed as well - I suspect that'd be a much bigger win, getting rid
of the IRQ trigger for polled IO, than most of the micro optimizations. For
Keith's testing, looks like he reduced the cost by turning on coalescing, but
it'd be cheaper (and better) to not have to rely on that.

In any case, the __set_current_state() is very cheap. Given that and the
above, I'd rather keep that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 21:02       ` Keith Busch
@ 2017-12-21 21:01         ` Jens Axboe
  2018-01-03 20:21           ` Keith Busch
  2017-12-25 10:12         ` Sagi Grimberg
  1 sibling, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2017-12-21 21:01 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, linux-block, Christoph Hellwig, Sagi Grimberg

On 12/21/17 2:02 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 01:53:44PM -0700, Jens Axboe wrote:
>> Turns out that wasn't what patch 2 was. And the code is right there
>> above as well, and under the q_lock, so I guess that race doesn't
>> exist.
>>
>> But that does bring up the fact if we should always be doing the
>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>> maybe it's better to make the submission path faster and skip it?
> 
> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
> submission path. Even under deeply queued IO, I've not seen this provide
> any measurable benefit.

I haven't either, but curious if others had. It's mostly just extra
overhead, I haven't seen it provide a latency reduction of any kind.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 20:53     ` Jens Axboe
@ 2017-12-21 21:02       ` Keith Busch
  2017-12-21 21:01         ` Jens Axboe
  2017-12-25 10:12         ` Sagi Grimberg
  0 siblings, 2 replies; 34+ messages in thread
From: Keith Busch @ 2017-12-21 21:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-nvme, linux-block, Christoph Hellwig, Sagi Grimberg

On Thu, Dec 21, 2017 at 01:53:44PM -0700, Jens Axboe wrote:
> Turns out that wasn't what patch 2 was. And the code is right there
> above as well, and under the q_lock, so I guess that race doesn't
> exist.
> 
> But that does bring up the fact if we should always be doing the
> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
> maybe it's better to make the submission path faster and skip it?

Yes, I am okay to remove the opprotunistic nvme_process_cq in the
submission path. Even under deeply queued IO, I've not seen this provide
any measurable benefit.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 21:00     ` Jens Axboe
@ 2017-12-21 21:34       ` Keith Busch
  2017-12-21 22:17         ` Jens Axboe
  2017-12-29  9:50         ` Christoph Hellwig
  0 siblings, 2 replies; 34+ messages in thread
From: Keith Busch @ 2017-12-21 21:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Scott Bauer, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On Thu, Dec 21, 2017 at 02:00:04PM -0700, Jens Axboe wrote:
> On 12/21/17 1:56 PM, Scott Bauer wrote:
> > On 12/21/2017 01:46 PM, Keith Busch wrote:
> >> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
> >>  	struct task_struct *waiter = bio->bi_private;
> >>  
> >>  	WRITE_ONCE(bio->bi_private, NULL);
> >> -	wake_up_process(waiter);
> >> +	if (current != waiter)
> >> +		wake_up_process(waiter);
> >> +	else
> >> +		__set_current_state(TASK_RUNNING);
> > 
> > Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?
> > 
> > Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).
> 
> We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that
> should be removed as well - I suspect that'd be a much bigger win, getting rid
> of the IRQ trigger for polled IO, than most of the micro optimizations. For
> Keith's testing, looks like he reduced the cost by turning on coalescing, but
> it'd be cheaper (and better) to not have to rely on that.

It would be nice, but the driver doesn't know a request's completion
is going to be a polled. Even if it did, we don't have a spec defined
way to tell the controller not to send an interrupt with this command's
compeletion, which would be negated anyway if any interrupt driven IO
is mixed in the same queue. We could possibly create a special queue
with interrupts disabled for this purpose if we can pass the HIPRI hint
through the request.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 21:34       ` Keith Busch
@ 2017-12-21 22:17         ` Jens Axboe
  2017-12-21 23:10           ` Keith Busch
  2017-12-29  9:50         ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2017-12-21 22:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Scott Bauer, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On 12/21/17 2:34 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 02:00:04PM -0700, Jens Axboe wrote:
>> On 12/21/17 1:56 PM, Scott Bauer wrote:
>>> On 12/21/2017 01:46 PM, Keith Busch wrote:
>>>> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>>>>  	struct task_struct *waiter = bio->bi_private;
>>>>  
>>>>  	WRITE_ONCE(bio->bi_private, NULL);
>>>> -	wake_up_process(waiter);
>>>> +	if (current != waiter)
>>>> +		wake_up_process(waiter);
>>>> +	else
>>>> +		__set_current_state(TASK_RUNNING);
>>>
>>> Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?
>>>
>>> Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).
>>
>> We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that
>> should be removed as well - I suspect that'd be a much bigger win, getting rid
>> of the IRQ trigger for polled IO, than most of the micro optimizations. For
>> Keith's testing, looks like he reduced the cost by turning on coalescing, but
>> it'd be cheaper (and better) to not have to rely on that.
> 
> It would be nice, but the driver doesn't know a request's completion
> is going to be a polled. 

That's trivially solvable though, since the information is available
at submission time.

> Even if it did, we don't have a spec defined
> way to tell the controller not to send an interrupt with this command's
> compeletion, which would be negated anyway if any interrupt driven IO
> is mixed in the same queue. We could possibly create a special queue
> with interrupts disabled for this purpose if we can pass the HIPRI hint
> through the request.

There's on way to do it per IO, right. But you can create a sq/cq pair
without interrupts enabled. This would also allow you to scale better
with multiple users of polling, a case where we currently don't
perform as well spdk, for instance.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 22:17         ` Jens Axboe
@ 2017-12-21 23:10           ` Keith Busch
  2017-12-22 15:40             ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Keith Busch @ 2017-12-21 23:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Scott Bauer, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On Thu, Dec 21, 2017 at 03:17:41PM -0700, Jens Axboe wrote:
> On 12/21/17 2:34 PM, Keith Busch wrote:
> > It would be nice, but the driver doesn't know a request's completion
> > is going to be a polled. 
> 
> That's trivially solvable though, since the information is available
> at submission time.
> 
> > Even if it did, we don't have a spec defined
> > way to tell the controller not to send an interrupt with this command's
> > compeletion, which would be negated anyway if any interrupt driven IO
> > is mixed in the same queue. We could possibly create a special queue
> > with interrupts disabled for this purpose if we can pass the HIPRI hint
> > through the request.
> 
> There's on way to do it per IO, right. But you can create a sq/cq pair
> without interrupts enabled. This would also allow you to scale better
> with multiple users of polling, a case where we currently don't
> perform as well spdk, for instance.

Would you be open to have blk-mq provide special hi-pri hardware contexts
for all these requests to come through? Maybe one per NUMA node? If not,
I don't think have enough unused bits in the NVMe command id to stash
the hctx id to extract the original request.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 23:10           ` Keith Busch
@ 2017-12-22 15:40             ` Jens Axboe
  0 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2017-12-22 15:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Scott Bauer, linux-nvme, linux-block, Christoph Hellwig,
	Sagi Grimberg

On 12/21/17 4:10 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 03:17:41PM -0700, Jens Axboe wrote:
>> On 12/21/17 2:34 PM, Keith Busch wrote:
>>> It would be nice, but the driver doesn't know a request's completion
>>> is going to be a polled. 
>>
>> That's trivially solvable though, since the information is available
>> at submission time.
>>
>>> Even if it did, we don't have a spec defined
>>> way to tell the controller not to send an interrupt with this command's
>>> compeletion, which would be negated anyway if any interrupt driven IO
>>> is mixed in the same queue. We could possibly create a special queue
>>> with interrupts disabled for this purpose if we can pass the HIPRI hint
>>> through the request.
>>
>> There's on way to do it per IO, right. But you can create a sq/cq pair
>> without interrupts enabled. This would also allow you to scale better
>> with multiple users of polling, a case where we currently don't
>> perform as well spdk, for instance.
> 
> Would you be open to have blk-mq provide special hi-pri hardware contexts
> for all these requests to come through? Maybe one per NUMA node? If not,
> I don't think have enough unused bits in the NVMe command id to stash
> the hctx id to extract the original request.

Yeah, in fact I think we HAVE to do it this way. I've been thinking about
this for a while, and ideally I'd really like blk-mq to support multiple
queue "pools". It's basically just a mapping thing. Right now you hand
blk-mq all your queues, and the mappings are defined for one set of
queues. It'd be nifty to support multiple sets, so we could do things
like "reserve X for polling", for example, and just have the mappings
magically work. blk_mq_map_queue() then just needs to take the bio
or request (or just cmd flags) to be able to decide what set the
request belongs to, making the mapping a function of {cpu,type}.

I originally played with this in the context of isolating writes on
a single queue, to reduce the amount of resources they can grab. And
it'd work nicely on this as well.

Completions could be configurable to where the submitter would do it
(like now, best for sync single thread), or to where you have a/more
kernel threads doing it (spdk'ish, best for high qd / thread count).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-21 20:46 ` [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path Keith Busch
  2017-12-21 20:54   ` Jens Axboe
@ 2017-12-25 10:10   ` Sagi Grimberg
  2017-12-27 21:01     ` Sagi Grimberg
  1 sibling, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2017-12-25 10:10 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Jens Axboe

Awesome!

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 20:46 ` [PATCH 1/3] nvme/pci: Start request after doorbell ring Keith Busch
  2017-12-21 20:49   ` Jens Axboe
@ 2017-12-25 10:11   ` Sagi Grimberg
  2017-12-26 20:35     ` Keith Busch
  2017-12-29  9:44   ` Christoph Hellwig
  2 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2017-12-25 10:11 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Jens Axboe


> This is a performance optimization that allows the hardware to work on
> a command in parallel with the kernel's stats and timeout tracking.

Curious to know what does this buy us?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 21:02       ` Keith Busch
  2017-12-21 21:01         ` Jens Axboe
@ 2017-12-25 10:12         ` Sagi Grimberg
  2017-12-29  9:44           ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2017-12-25 10:12 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe; +Cc: linux-nvme, linux-block, Christoph Hellwig


>> But that does bring up the fact if we should always be doing the
>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>> maybe it's better to make the submission path faster and skip it?
> 
> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
> submission path. Even under deeply queued IO, I've not seen this provide
> any measurable benefit.

+100 for removing it. Never really understood why it gets us anything...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-25 10:11   ` Sagi Grimberg
@ 2017-12-26 20:35     ` Keith Busch
  2017-12-27  9:02       ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Keith Busch @ 2017-12-26 20:35 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe

On Mon, Dec 25, 2017 at 12:11:47PM +0200, Sagi Grimberg wrote:
> > This is a performance optimization that allows the hardware to work on
> > a command in parallel with the kernel's stats and timeout tracking.
> 
> Curious to know what does this buy us?

blk_mq_start_request doesn't do anything to make the command ready to
submit to hardware, so this is pure software overhead, somewhere between
200-300 nanoseconds as far as I can measure (YMMV). We can post the command
to hardware before taking on this software overhead so the hardware gets
to fetch the command that much sooner.

One thing to remember is we need to ensure the driver can't complete the
request before starting it. The driver's exisitng locking does ensure
that with this patch, so it's just something to keep in mind if anything
should ever change in the future.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-26 20:35     ` Keith Busch
@ 2017-12-27  9:02       ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2017-12-27  9:02 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe


>> Curious to know what does this buy us?
> 
> blk_mq_start_request doesn't do anything to make the command ready to
> submit to hardware, so this is pure software overhead, somewhere between
> 200-300 nanoseconds as far as I can measure (YMMV). We can post the command
> to hardware before taking on this software overhead so the hardware gets
> to fetch the command that much sooner.
> 
> One thing to remember is we need to ensure the driver can't complete the
> request before starting it. The driver's exisitng locking does ensure
> that with this patch, so it's just something to keep in mind if anything
> should ever change in the future.

Needs to be well documented.

This mean we won't be able to split the submission and completion locks
now... So it does present a tradeoff.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-25 10:10   ` Sagi Grimberg
@ 2017-12-27 21:01     ` Sagi Grimberg
  2017-12-29  9:48       ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2017-12-27 21:01 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Jens Axboe


> Awesome!
> 
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Wait, aren't we unquiesce queues also in nvme_dev_disable?

Doesn't that rely that the queues are suspended and queue_rq
will fail there?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 20:46 ` [PATCH 1/3] nvme/pci: Start request after doorbell ring Keith Busch
  2017-12-21 20:49   ` Jens Axboe
  2017-12-25 10:11   ` Sagi Grimberg
@ 2017-12-29  9:44   ` Christoph Hellwig
  2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg

Looks fine, although I agree a comment on the q_lock exclusion would
be useful.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-25 10:12         ` Sagi Grimberg
@ 2017-12-29  9:44           ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:44 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, linux-nvme, linux-block,
	Christoph Hellwig

On Mon, Dec 25, 2017 at 12:12:52PM +0200, Sagi Grimberg wrote:
>
>>> But that does bring up the fact if we should always be doing the
>>> nvme_process_cq(nvmeq) after IO submission. For direct/hipri IO,
>>> maybe it's better to make the submission path faster and skip it?
>>
>> Yes, I am okay to remove the opprotunistic nvme_process_cq in the
>> submission path. Even under deeply queued IO, I've not seen this provide
>> any measurable benefit.
>
> +100 for removing it. Never really understood why it gets us anything...

Yes, we should get rid of it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-27 21:01     ` Sagi Grimberg
@ 2017-12-29  9:48       ` Christoph Hellwig
  2017-12-29 15:39         ` Keith Busch
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:48 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, linux-nvme, linux-block, Christoph Hellwig,
	Jens Axboe

On Wed, Dec 27, 2017 at 11:01:48PM +0200, Sagi Grimberg wrote:
>
>> Awesome!
>>
>> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
>
> Wait, aren't we unquiesce queues also in nvme_dev_disable?
>
> Doesn't that rely that the queues are suspended and queue_rq
> will fail there?

We don't seem to have any other check as far as I can tell.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 21:34       ` Keith Busch
  2017-12-21 22:17         ` Jens Axboe
@ 2017-12-29  9:50         ` Christoph Hellwig
  2017-12-29 15:51           ` Keith Busch
  2017-12-31 12:48           ` Sagi Grimberg
  1 sibling, 2 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Scott Bauer, linux-nvme, linux-block,
	Christoph Hellwig, Sagi Grimberg

On Thu, Dec 21, 2017 at 02:34:00PM -0700, Keith Busch wrote:
> It would be nice, but the driver doesn't know a request's completion
> is going to be a polled.

We can trivially set a REQ_POLL bit.  In fact my initial patch kit had
those on the insitence of Jens, but then I removed it because we had
no users for it.

> Even if it did, we don't have a spec defined
> way to tell the controller not to send an interrupt with this command's
> compeletion, which would be negated anyway if any interrupt driven IO
> is mixed in the same queue.

Time to add such a flag to the spec then..

> We could possibly create a special queue
> with interrupts disabled for this purpose if we can pass the HIPRI hint
> through the request.

Eww.  That means we double our resource usage both in the host and
the device, which isn't going to be pretty.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-21 20:46 ` [PATCH 3/3] block: Polling completion performance optimization Keith Busch
  2017-12-21 20:56   ` Scott Bauer
  2017-12-21 20:57   ` Jens Axboe
@ 2017-12-29  9:51   ` Christoph Hellwig
  2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2017-12-29  9:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg

This looks fine, but we also need the same in blkdev_bio_end_io and
iomap_dio_bio_end_io.  Probably best to add a little helper.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-29  9:48       ` Christoph Hellwig
@ 2017-12-29 15:39         ` Keith Busch
  2017-12-31 12:30           ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Keith Busch @ 2017-12-29 15:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, linux-nvme, linux-block, Jens Axboe

On Fri, Dec 29, 2017 at 10:48:14AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 27, 2017 at 11:01:48PM +0200, Sagi Grimberg wrote:
> >
> >> Awesome!
> >>
> >> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> >
> > Wait, aren't we unquiesce queues also in nvme_dev_disable?
> >
> > Doesn't that rely that the queues are suspended and queue_rq
> > will fail there?
> 
> We don't seem to have any other check as far as I can tell.

Ok, we currently do need this check in the submission side to flush
entered requests on invalidated hw contexts. We didn't have a way to
back out entered requests before, so that's why this request killing
check still exists. We can steal bio's to back out requests now, so I
think we should do that instead of failing them.

I'll do rework the series a bit to make that possible, plus add the
removal of the submission side nvme_process_cq, and resend.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-29  9:50         ` Christoph Hellwig
@ 2017-12-29 15:51           ` Keith Busch
  2017-12-31 12:48           ` Sagi Grimberg
  1 sibling, 0 replies; 34+ messages in thread
From: Keith Busch @ 2017-12-29 15:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Scott Bauer, linux-nvme, linux-block, Sagi Grimberg

On Fri, Dec 29, 2017 at 10:50:21AM +0100, Christoph Hellwig wrote:
> > We could possibly create a special queue
> > with interrupts disabled for this purpose if we can pass the HIPRI hint
> > through the request.
> 
> Eww.  That means we double our resource usage both in the host and
> the device, which isn't going to be pretty.

Yes, this would create some trouble for our simple policy for divvying
queue resources. We wouldn't be able to use the PCI IRQ affinity for
these queues CPU map, for example.

Maybe if we can figure out a good policy for WRR queues, that might
present a similar infrastructure for supporting polled queues as well.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-29 15:39         ` Keith Busch
@ 2017-12-31 12:30           ` Sagi Grimberg
  2018-01-02 16:50             ` Keith Busch
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2017-12-31 12:30 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig; +Cc: linux-nvme, linux-block, Jens Axboe


>>>> Awesome!
>>>>
>>>> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
>>>
>>> Wait, aren't we unquiesce queues also in nvme_dev_disable?
>>>
>>> Doesn't that rely that the queues are suspended and queue_rq
>>> will fail there?
>>
>> We don't seem to have any other check as far as I can tell.
> 
> Ok, we currently do need this check in the submission side to flush
> entered requests on invalidated hw contexts. We didn't have a way to
> back out entered requests before, so that's why this request killing
> check still exists. We can steal bio's to back out requests now, so I
> think we should do that instead of failing them.
> 
> I'll do rework the series a bit to make that possible, plus add the
> removal of the submission side nvme_process_cq, and resend.

Not sure if stealing bios from requests is a better design. Note that
we do exactly this in other transport (nvme_[rdma|loop|fc]_is_ready).

I think it would be better to stick to a coherent behavior across
the nvme subsystem. If this condition statement is really something
that is buying us measurable performance gain, I think we should apply
it for other transports as well (although in fabrics we're a bit
different because we have a dedicated connect that enters .queue_rq)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] block: Polling completion performance optimization
  2017-12-29  9:50         ` Christoph Hellwig
  2017-12-29 15:51           ` Keith Busch
@ 2017-12-31 12:48           ` Sagi Grimberg
  1 sibling, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2017-12-31 12:48 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, Scott Bauer, linux-nvme, linux-block



On 12/29/2017 11:50 AM, Christoph Hellwig wrote:
> On Thu, Dec 21, 2017 at 02:34:00PM -0700, Keith Busch wrote:
>> It would be nice, but the driver doesn't know a request's completion
>> is going to be a polled.
> 
> We can trivially set a REQ_POLL bit.  In fact my initial patch kit had
> those on the insitence of Jens, but then I removed it because we had
> no users for it.
> 
>> Even if it did, we don't have a spec defined
>> way to tell the controller not to send an interrupt with this command's
>> compeletion, which would be negated anyway if any interrupt driven IO
>> is mixed in the same queue.
> 
> Time to add such a flag to the spec then..

That would be very useful, ideally we can also hook it into libaio
to submit without triggering an interrupt and have io_getevents to poll
the underlying bdev (blk_poll) similar to how the net stack implements
low latency sockets [1].

Having the ability to suppress interrupts and poll for completions would
be very useful for file servers or targets living in userspace.

https://lwn.net/Articles/551284/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path
  2017-12-31 12:30           ` Sagi Grimberg
@ 2018-01-02 16:50             ` Keith Busch
  0 siblings, 0 replies; 34+ messages in thread
From: Keith Busch @ 2018-01-02 16:50 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, linux-nvme, linux-block, Jens Axboe

On Sun, Dec 31, 2017 at 02:30:09PM +0200, Sagi Grimberg wrote:
> Not sure if stealing bios from requests is a better design. Note that
> we do exactly this in other transport (nvme_[rdma|loop|fc]_is_ready).

Well, we're currently failing requests that may succeed if we could
back them out for re-entry. While such scenarios are uncommon, I think
we can handle it better than ending them in failure.
 
> I think it would be better to stick to a coherent behavior across
> the nvme subsystem. If this condition statement is really something
> that is buying us measurable performance gain, I think we should apply
> it for other transports as well (although in fabrics we're a bit
> different because we have a dedicated connect that enters .queue_rq)

We should be able to remove all the state checks in the IO path because
the handlers putting the controller in an IO-incapable state should be
able to quiece the queues before transitioning to that state.

This is not a significant gain compared to the other two patches in this
series, but the sum of all the little things become meaningful.

We'd need to remove this check anyway if we're considering exclusively
polled queues since they wouldn't have a CQ vector.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2017-12-21 21:01         ` Jens Axboe
@ 2018-01-03 20:21           ` Keith Busch
  2018-01-23  0:16             ` Keith Busch
  0 siblings, 1 reply; 34+ messages in thread
From: Keith Busch @ 2018-01-03 20:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-nvme, linux-block, Christoph Hellwig, Sagi Grimberg

On Thu, Dec 21, 2017 at 02:01:44PM -0700, Jens Axboe wrote:
> I haven't either, but curious if others had. It's mostly just extra
> overhead, I haven't seen it provide a latency reduction of any kind.

I've removed the submission side poll in a local build, and amazingly I
am observing a not insignificant increase in latency without it on some
workloads with certain hardware. I will have to delay recommending/posting
removal of this pending further investigation.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] nvme/pci: Start request after doorbell ring
  2018-01-03 20:21           ` Keith Busch
@ 2018-01-23  0:16             ` Keith Busch
  0 siblings, 0 replies; 34+ messages in thread
From: Keith Busch @ 2018-01-23  0:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-nvme, linux-block, Christoph Hellwig, Sagi Grimberg

On Wed, Jan 03, 2018 at 01:21:05PM -0700, Keith Busch wrote:
> 
> I've removed the submission side poll in a local build, and amazingly I
> am observing a not insignificant increase in latency without it on some
> workloads with certain hardware. I will have to delay recommending/posting
> removal of this pending further investigation.

This took longer to narrow than I hoped. I had been getting very wild
results, and there were two things to blame: a kernel irq issue[*],
and a platform bios causing a lot of CPU thermal frequency throttling.

Now that's sorted out, I am back to tuning this and the opprotunistic
nvme submission side polling once again does not produce a measurable
difference. My world is sane again. Will try to send an update for
consideration this week after some more testing.

 * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=a0c9259dc4

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-01-23  0:16 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-21 20:46 [PATCH 0/3] Performance enhancements Keith Busch
2017-12-21 20:46 ` [PATCH 1/3] nvme/pci: Start request after doorbell ring Keith Busch
2017-12-21 20:49   ` Jens Axboe
2017-12-21 20:53     ` Jens Axboe
2017-12-21 21:02       ` Keith Busch
2017-12-21 21:01         ` Jens Axboe
2018-01-03 20:21           ` Keith Busch
2018-01-23  0:16             ` Keith Busch
2017-12-25 10:12         ` Sagi Grimberg
2017-12-29  9:44           ` Christoph Hellwig
2017-12-25 10:11   ` Sagi Grimberg
2017-12-26 20:35     ` Keith Busch
2017-12-27  9:02       ` Sagi Grimberg
2017-12-29  9:44   ` Christoph Hellwig
2017-12-21 20:46 ` [PATCH 2/3] nvme/pci: Remove cq_vector check in IO path Keith Busch
2017-12-21 20:54   ` Jens Axboe
2017-12-25 10:10   ` Sagi Grimberg
2017-12-27 21:01     ` Sagi Grimberg
2017-12-29  9:48       ` Christoph Hellwig
2017-12-29 15:39         ` Keith Busch
2017-12-31 12:30           ` Sagi Grimberg
2018-01-02 16:50             ` Keith Busch
2017-12-21 20:46 ` [PATCH 3/3] block: Polling completion performance optimization Keith Busch
2017-12-21 20:56   ` Scott Bauer
2017-12-21 21:00     ` Jens Axboe
2017-12-21 21:34       ` Keith Busch
2017-12-21 22:17         ` Jens Axboe
2017-12-21 23:10           ` Keith Busch
2017-12-22 15:40             ` Jens Axboe
2017-12-29  9:50         ` Christoph Hellwig
2017-12-29 15:51           ` Keith Busch
2017-12-31 12:48           ` Sagi Grimberg
2017-12-21 20:57   ` Jens Axboe
2017-12-29  9:51   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).