All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Remove redundant writes to uncached sqe memory
@ 2014-05-09 20:44 Sam Bradshaw
  2014-05-09 21:14 ` Matias Bjørling
  2014-05-11  2:52 ` Matthew Wilcox
  0 siblings, 2 replies; 5+ messages in thread
From: Sam Bradshaw @ 2014-05-09 20:44 UTC (permalink / raw)


The memset to clear the SQE in nvme_submit_iod() is made partially 
redundant by subsequent writes.  This patch explicitly clears each 
SQE structure member in ascending order, eliminating the need for 
the memset.  With this change, our perf runs show ~1.5% less time
spent in the IO submission path and minor reduced q lock contention.

Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index cd8a8bc..a9bdcbd 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -655,11 +655,12 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
 	cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
-	memset(cmnd, 0, sizeof(*cmnd));
 
 	cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
+	cmnd->rw.flags = 0;
 	cmnd->rw.command_id = cmdid;
 	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.rsvd2 = 0;
 	cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
 	cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
@@ -667,6 +668,9 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
 		cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
 	cmnd->rw.control = cpu_to_le16(control);
 	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
+	cmnd->rw.reftag = 0;
+	cmnd->rw.apptag = 0;
+	cmnd->rw.appmask = 0;
 
 	if (++nvmeq->sq_tail == nvmeq->q_depth)
 		nvmeq->sq_tail = 0;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] Remove redundant writes to uncached sqe memory
  2014-05-09 20:44 [PATCH] Remove redundant writes to uncached sqe memory Sam Bradshaw
@ 2014-05-09 21:14 ` Matias Bjørling
  2014-05-11  2:54   ` Matthew Wilcox
  2014-05-11  2:52 ` Matthew Wilcox
  1 sibling, 1 reply; 5+ messages in thread
From: Matias Bjørling @ 2014-05-09 21:14 UTC (permalink / raw)


On Fri, May 9, 2014@10:44 PM, Sam Bradshaw <sbradshaw@micron.com> wrote:
> The memset to clear the SQE in nvme_submit_iod() is made partially
> redundant by subsequent writes.  This patch explicitly clears each
> SQE structure member in ascending order, eliminating the need for
> the memset.  With this change, our perf runs show ~1.5% less time
> spent in the IO submission path and minor reduced q lock contention.
>
> Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
> ---
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index cd8a8bc..a9bdcbd 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -655,11 +655,12 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
>                 dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>
>         cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
> -       memset(cmnd, 0, sizeof(*cmnd));
>
>         cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
> +       cmnd->rw.flags = 0;
>         cmnd->rw.command_id = cmdid;
>         cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
> +       cmnd->rw.rsvd2 = 0;
>         cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
>         cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
>         cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
> @@ -667,6 +668,9 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
>                 cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
>         cmnd->rw.control = cpu_to_le16(control);
>         cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
> +       cmnd->rw.reftag = 0;
> +       cmnd->rw.apptag = 0;
> +       cmnd->rw.appmask = 0;
>
>         if (++nvmeq->sq_tail == nvmeq->q_depth)
>                 nvmeq->sq_tail = 0;
>

I think a description above the declaration of struct nvme_rw_command
would be helpful. E.g. "If you modify this structure, please change
nvme_submit_iod accordingly to take care of initialization."

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] Remove redundant writes to uncached sqe memory
  2014-05-09 20:44 [PATCH] Remove redundant writes to uncached sqe memory Sam Bradshaw
  2014-05-09 21:14 ` Matias Bjørling
@ 2014-05-11  2:52 ` Matthew Wilcox
  2014-05-12 19:41   ` Sam Bradshaw (sbradshaw)
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2014-05-11  2:52 UTC (permalink / raw)


On Fri, May 09, 2014@01:44:47PM -0700, Sam Bradshaw wrote:
> The memset to clear the SQE in nvme_submit_iod() is made partially 
> redundant by subsequent writes.  This patch explicitly clears each 
> SQE structure member in ascending order, eliminating the need for 
> the memset.  With this change, our perf runs show ~1.5% less time
> spent in the IO submission path and minor reduced q lock contention.

I'm shocked!  I thought that zeroing the cacheline first would be better
performing than storing into parts of the cacheline.  But I can't argue
with your numbers.  I think your patch is missing a store to the metadata
element though; care to rerun your test with that added?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] Remove redundant writes to uncached sqe memory
  2014-05-09 21:14 ` Matias Bjørling
@ 2014-05-11  2:54   ` Matthew Wilcox
  0 siblings, 0 replies; 5+ messages in thread
From: Matthew Wilcox @ 2014-05-11  2:54 UTC (permalink / raw)


On Fri, May 09, 2014@11:14:46PM +0200, Matias Bj?rling wrote:
> I think a description above the declaration of struct nvme_rw_command
> would be helpful. E.g. "If you modify this structure, please change
> nvme_submit_iod accordingly to take care of initialization."

I don't think that's necessary.  There's no hidden gaps in the definition
of nvme_rw_command, so the explicit initialisation of all the fields will
write all of the bytes in the command.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] Remove redundant writes to uncached sqe memory
  2014-05-11  2:52 ` Matthew Wilcox
@ 2014-05-12 19:41   ` Sam Bradshaw (sbradshaw)
  0 siblings, 0 replies; 5+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2014-05-12 19:41 UTC (permalink / raw)




> -----Original Message-----
> From: Matthew Wilcox [mailto:willy at linux.intel.com]
> Sent: Saturday, May 10, 2014 7:53 PM
> To: Sam Bradshaw (sbradshaw)
> Cc: linux-nvme at lists.infradead.org
> Subject: Re: [PATCH] Remove redundant writes to uncached sqe memory
> 
> On Fri, May 09, 2014@01:44:47PM -0700, Sam Bradshaw wrote:
> > The memset to clear the SQE in nvme_submit_iod() is made partially
> > redundant by subsequent writes.  This patch explicitly clears each
> > SQE structure member in ascending order, eliminating the need for
> > the memset.  With this change, our perf runs show ~1.5% less time
> > spent in the IO submission path and minor reduced q lock contention.
> 
> I'm shocked!  I thought that zeroing the cacheline first would be
> better
> performing than storing into parts of the cacheline.  But I can't argue
> with your numbers.  I think your patch is missing a store to the
> metadata
> element though; care to rerun your test with that added?

Yes, thanks for pointing out the missing store to ->metadata.

Without memset:

+  35.35%  fio  nvme_process_cq 
+  10.85%  fio  nvme_make_request 
+  10.64%  fio  nvme_map_bio      
+  10.49%  fio  free_cmdid        
+   9.65%  fio  alloc_cmdid       
+   9.52%  fio  bio_completion    
+   5.18%  fio  nvme_submit_iod   
+   2.38%  fio  nvme_submit_bio_queue
+   2.26%  fio  nvme_alloc_iod       
+   2.17%  fio  nvme_setup_prps      
+   0.90%  fio  nvme_free_iod        
+   0.61%  fio  nvme_irq                                      

With memset:

+  36.24%  fio  nvme_process_cq    
+  11.04%  fio  nvme_make_request  
+  10.76%  fio  nvme_map_bio       
+   9.48%  fio  free_cmdid         
+   9.24%  fio  alloc_cmdid        
+   8.51%  fio  bio_completion     
+   6.33%  fio  nvme_submit_iod    
+   2.38%  fio  nvme_submit_bio_queue
+   2.30%  fio  nvme_alloc_iod       
+   2.18%  fio  nvme_setup_prps      
+   0.91%  fio  nvme_free_iod        
+   0.62%  fio  nvme_irq            

The numbers pretty consistently show nvme_submit_io() taking more 
cpu time with the memset.  But, to be fair there is run-to-run 
variation +/- ~1%  in cpu time for any of the bigger offenders.  

The test setup is a bit odd for several (deliberate) reasons:

1) we have fewer queue pairs supported by hardware than cpu cores.
2) fio io submission threads are pegged to cores on nodes other than
the nodes owning the SQ/CQ physical memory, causing the SQE update
to cross a QPI link.
3) io submission threads concurrently run on more than one remote 
core and contend for the SQ lock.

If you're not so keen on these data as justification, I could rerun
the experiment on a platforms with differing coherency models. May 
take a bit of time, though.

-Sam


 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-05-12 19:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-09 20:44 [PATCH] Remove redundant writes to uncached sqe memory Sam Bradshaw
2014-05-09 21:14 ` Matias Bjørling
2014-05-11  2:54   ` Matthew Wilcox
2014-05-11  2:52 ` Matthew Wilcox
2014-05-12 19:41   ` Sam Bradshaw (sbradshaw)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.