[PATCH] NVMe: Remove superfluous cqe

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] NVMe: Remove superfluous cqe_seen
@ 2014-05-21 18:14 Sam Bradshaw
  2014-05-21 23:17 ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Sam Bradshaw @ 2014-05-21 18:14 UTC (permalink / raw)


cqe_seen is redundant with the return value from nvme_process_cq().
Remove it.

Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index cd8a8bc..c2287ac 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -102,7 +102,6 @@ struct nvme_queue {
 	u16 cq_head;
 	u16 qid;
 	u8 cq_phase;
-	u8 cqe_seen;
 	u8 q_suspended;
 	cpumask_var_t cpu_mask;
 	struct async_cmd_info cmdinfo;
@@ -790,7 +789,6 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
 	nvmeq->cq_head = head;
 	nvmeq->cq_phase = phase;
 
-	nvmeq->cqe_seen = 1;
 	return 1;
 }
 
@@ -826,8 +824,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	struct nvme_queue *nvmeq = data;
 	spin_lock(&nvmeq->q_lock);
 	nvme_process_cq(nvmeq);
-	result = nvmeq->cqe_seen ? IRQ_HANDLED : IRQ_NONE;
-	nvmeq->cqe_seen = 0;
+	result = nvme_process_cq(nvmeq) ? IRQ_HANDLED : IRQ_NONE;
 	spin_unlock(&nvmeq->q_lock);
 	return result;
 }

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] NVMe: Remove superfluous cqe_seen
  2014-05-21 18:14 [PATCH] NVMe: Remove superfluous cqe_seen Sam Bradshaw
@ 2014-05-21 23:17 ` Matthew Wilcox
  2014-05-22  0:10   ` Sam Bradshaw (sbradshaw)
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2014-05-21 23:17 UTC (permalink / raw)

On Wed, May 21, 2014@11:14:25AM -0700, Sam Bradshaw wrote:
> cqe_seen is redundant with the return value from nvme_process_cq().
> Remove it.

Ah, but it isn't.  Look at commit e9539f47525.  The purpose of cqe_seen
is to be 'sticky', that we *have* indeed processed something on this CQ
since the last interrupt, and so if we get an interrupt, the device may
have legitimately interrupted us.

A more interesting patch might be to implement some kind of autotuning of
the interrupt coalescing -- if we're seeing too many times tha the queue
has no work on it, then we should increase the time or queue depth used.
I'm not sure what event would cause us to *reduce* the time or queue
depth, so I fear we'd end up set to a suboptimal value over time.

You could argue that the return value from nvme_process_cq() is now
never used, and that would be correct, but I have a patch in the works
that will use it again, so I don't think it's worth taking out.

Was there a performance problem that this solves, or did you just notice
it, and send a cleanup?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] NVMe: Remove superfluous cqe_seen
  2014-05-21 23:17 ` Matthew Wilcox
@ 2014-05-22  0:10   ` Sam Bradshaw (sbradshaw)
  2014-05-23 14:55     ` Matthew Wilcox
  2014-06-19 16:59     ` Matthew Wilcox
  0 siblings, 2 replies; 5+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2014-05-22  0:10 UTC (permalink / raw)




> -----Original Message-----
> From: Matthew Wilcox [mailto:willy at linux.intel.com]
> Sent: Wednesday, May 21, 2014 4:17 PM
> To: Sam Bradshaw (sbradshaw)
> Cc: linux-nvme at lists.infradead.org
> Subject: Re: [PATCH] NVMe: Remove superfluous cqe_seen
> 
> On Wed, May 21, 2014@11:14:25AM -0700, Sam Bradshaw wrote:
> > cqe_seen is redundant with the return value from nvme_process_cq().
> > Remove it.
> 
> Ah, but it isn't.  Look at commit e9539f47525.  The purpose of cqe_seen
> is to be 'sticky', that we *have* indeed processed something on this CQ
> since the last interrupt, and so if we get an interrupt, the device may
> have legitimately interrupted us.
> 
> A more interesting patch might be to implement some kind of autotuning
> of
> the interrupt coalescing -- if we're seeing too many times tha the
> queue
> has no work on it, then we should increase the time or queue depth
> used.
> I'm not sure what event would cause us to *reduce* the time or queue
> depth, so I fear we'd end up set to a suboptimal value over time.
> 
> You could argue that the return value from nvme_process_cq() is now
> never used, and that would be correct, but I have a patch in the works
> that will use it again, so I don't think it's worth taking out.
> 
> Was there a performance problem that this solves, or did you just
> notice
> it, and send a cleanup?

Performance problem, though not very easily measured.  At very high iops
rates, most if not all cqe's are processed via nvme_process_cq() in 
make_request(), leaving nvme_irq() with no work to do.  Nevertheless, it
always writes cqe_seen, which invalidates a very hot cacheline.  This
is somewhat exacerbated when IO submissions originate on a remote node
relative to the cpu handling the irq.

Perhaps a simpler patch where cqe_seen is only written if set? 

-Sam

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] NVMe: Remove superfluous cqe_seen
  2014-05-22  0:10   ` Sam Bradshaw (sbradshaw)
@ 2014-05-23 14:55     ` Matthew Wilcox
  2014-06-19 16:59     ` Matthew Wilcox
  1 sibling, 0 replies; 5+ messages in thread
From: Matthew Wilcox @ 2014-05-23 14:55 UTC (permalink / raw)


On Thu, May 22, 2014@12:10:19AM +0000, Sam Bradshaw (sbradshaw) wrote:
> Performance problem, though not very easily measured.  At very high iops
> rates, most if not all cqe's are processed via nvme_process_cq() in 
> make_request(), leaving nvme_irq() with no work to do.  Nevertheless, it
> always writes cqe_seen, which invalidates a very hot cacheline.  This
> is somewhat exacerbated when IO submissions originate on a remote node
> relative to the cpu handling the irq.
> 
> Perhaps a simpler patch where cqe_seen is only written if set? 

Sure, we could do that, although in the scenario you're talking about,
cqe_seen would almost always get written.  I think your workload would
perhaps benefit from tuning the interrupt coalescing settings.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] NVMe: Remove superfluous cqe_seen
  2014-05-22  0:10   ` Sam Bradshaw (sbradshaw)
  2014-05-23 14:55     ` Matthew Wilcox
@ 2014-06-19 16:59     ` Matthew Wilcox
  1 sibling, 0 replies; 5+ messages in thread
From: Matthew Wilcox @ 2014-06-19 16:59 UTC (permalink / raw)

On Thu, May 22, 2014@12:10:19AM +0000, Sam Bradshaw (sbradshaw) wrote:
> Performance problem, though not very easily measured.  At very high iops
> rates, most if not all cqe's are processed via nvme_process_cq() in 
> make_request(), leaving nvme_irq() with no work to do.  Nevertheless, it
> always writes cqe_seen, which invalidates a very hot cacheline.  This
> is somewhat exacerbated when IO submissions originate on a remote node
> relative to the cpu handling the irq.

I was thinking "Hey, we should move cqe_seen to a different cacheline".
So I looked at the cacheline assignments for the different variables,
and cqe_seen is on the same cacheline as cq_head and cq_phase, so that
cacheline is already being dirtied.  Indeed, it's in the same Dword as
cq_phase, so I'd be amazed if the CPU didn't coalesce the two writes.
That might be a more fruitful patch ... rearrange nvme_queue to put
cq_head, cq_phase and cqe_seen in the same Dword, and expect the CPU to
optimise the three assignments into a single Dword store.

I'll let you try it out since you have the setup to benchmark it.  Right now,
this is the layout I see:

        /* --- cacheline 3 boundary (192 bytes) --- */
        u32 *                      q_db;                 /*   192     8 */
        u16                        q_depth;              /*   200     2 */
        u16                        cq_vector;            /*   202     2 */
        u16                        sq_head;              /*   204     2 */
        u16                        sq_tail;              /*   206     2 */
        u16                        cq_head;              /*   208     2 */
        u16                        qid;                  /*   210     2 */
        u8                         cq_phase;             /*   212     1 */
        u8                         cqe_seen;             /*   213     1 */
        u8                         q_suspended;          /*   214     1 */

I notice a 4-byte hole after q_lock, so moving cq_head, cq_phase and
cqe_seen into that space would probably be a good idea (since that
cacheline is definitely dirty).  I really haven't tried to optimise the
frequently-updated parts of the data structure into the same cacheline,
and it should really help your bizarre setup :-).

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-06-19 16:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-21 18:14 [PATCH] NVMe: Remove superfluous cqe_seen Sam Bradshaw
2014-05-21 23:17 ` Matthew Wilcox
2014-05-22  0:10   ` Sam Bradshaw (sbradshaw)
2014-05-23 14:55     ` Matthew Wilcox
2014-06-19 16:59     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox