* [PATCH] Remove redundant writes to uncached sqe memory
@ 2014-05-09 20:44 Sam Bradshaw
2014-05-09 21:14 ` Matias Bjørling
2014-05-11 2:52 ` Matthew Wilcox
0 siblings, 2 replies; 5+ messages in thread
From: Sam Bradshaw @ 2014-05-09 20:44 UTC (permalink / raw)
The memset to clear the SQE in nvme_submit_iod() is made partially
redundant by subsequent writes. This patch explicitly clears each
SQE structure member in ascending order, eliminating the need for
the memset. With this change, our perf runs show ~1.5% less time
spent in the IO submission path and minor reduced q lock contention.
Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index cd8a8bc..a9bdcbd 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -655,11 +655,12 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
- memset(cmnd, 0, sizeof(*cmnd));
cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
+ cmnd->rw.flags = 0;
cmnd->rw.command_id = cmdid;
cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+ cmnd->rw.rsvd2 = 0;
cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
@@ -667,6 +668,9 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
cmnd->rw.control = cpu_to_le16(control);
cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
+ cmnd->rw.reftag = 0;
+ cmnd->rw.apptag = 0;
+ cmnd->rw.appmask = 0;
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH] Remove redundant writes to uncached sqe memory
2014-05-09 20:44 [PATCH] Remove redundant writes to uncached sqe memory Sam Bradshaw
@ 2014-05-09 21:14 ` Matias Bjørling
2014-05-11 2:54 ` Matthew Wilcox
2014-05-11 2:52 ` Matthew Wilcox
1 sibling, 1 reply; 5+ messages in thread
From: Matias Bjørling @ 2014-05-09 21:14 UTC (permalink / raw)
On Fri, May 9, 2014@10:44 PM, Sam Bradshaw <sbradshaw@micron.com> wrote:
> The memset to clear the SQE in nvme_submit_iod() is made partially
> redundant by subsequent writes. This patch explicitly clears each
> SQE structure member in ascending order, eliminating the need for
> the memset. With this change, our perf runs show ~1.5% less time
> spent in the IO submission path and minor reduced q lock contention.
>
> Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
> ---
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index cd8a8bc..a9bdcbd 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -655,11 +655,12 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
> dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>
> cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
> - memset(cmnd, 0, sizeof(*cmnd));
>
> cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
> + cmnd->rw.flags = 0;
> cmnd->rw.command_id = cmdid;
> cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
> + cmnd->rw.rsvd2 = 0;
> cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
> cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
> cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
> @@ -667,6 +668,9 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
> cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
> cmnd->rw.control = cpu_to_le16(control);
> cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
> + cmnd->rw.reftag = 0;
> + cmnd->rw.apptag = 0;
> + cmnd->rw.appmask = 0;
>
> if (++nvmeq->sq_tail == nvmeq->q_depth)
> nvmeq->sq_tail = 0;
>
I think a description above the declaration of struct nvme_rw_command
would be helpful. E.g. "If you modify this structure, please change
nvme_submit_iod accordingly to take care of initialization."
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] Remove redundant writes to uncached sqe memory
2014-05-09 20:44 [PATCH] Remove redundant writes to uncached sqe memory Sam Bradshaw
2014-05-09 21:14 ` Matias Bjørling
@ 2014-05-11 2:52 ` Matthew Wilcox
2014-05-12 19:41 ` Sam Bradshaw (sbradshaw)
1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2014-05-11 2:52 UTC (permalink / raw)
On Fri, May 09, 2014@01:44:47PM -0700, Sam Bradshaw wrote:
> The memset to clear the SQE in nvme_submit_iod() is made partially
> redundant by subsequent writes. This patch explicitly clears each
> SQE structure member in ascending order, eliminating the need for
> the memset. With this change, our perf runs show ~1.5% less time
> spent in the IO submission path and minor reduced q lock contention.
I'm shocked! I thought that zeroing the cacheline first would be better
performing than storing into parts of the cacheline. But I can't argue
with your numbers. I think your patch is missing a store to the metadata
element though; care to rerun your test with that added?
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] Remove redundant writes to uncached sqe memory
2014-05-09 21:14 ` Matias Bjørling
@ 2014-05-11 2:54 ` Matthew Wilcox
0 siblings, 0 replies; 5+ messages in thread
From: Matthew Wilcox @ 2014-05-11 2:54 UTC (permalink / raw)
On Fri, May 09, 2014@11:14:46PM +0200, Matias Bj?rling wrote:
> I think a description above the declaration of struct nvme_rw_command
> would be helpful. E.g. "If you modify this structure, please change
> nvme_submit_iod accordingly to take care of initialization."
I don't think that's necessary. There's no hidden gaps in the definition
of nvme_rw_command, so the explicit initialisation of all the fields will
write all of the bytes in the command.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] Remove redundant writes to uncached sqe memory
2014-05-11 2:52 ` Matthew Wilcox
@ 2014-05-12 19:41 ` Sam Bradshaw (sbradshaw)
0 siblings, 0 replies; 5+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2014-05-12 19:41 UTC (permalink / raw)
> -----Original Message-----
> From: Matthew Wilcox [mailto:willy at linux.intel.com]
> Sent: Saturday, May 10, 2014 7:53 PM
> To: Sam Bradshaw (sbradshaw)
> Cc: linux-nvme at lists.infradead.org
> Subject: Re: [PATCH] Remove redundant writes to uncached sqe memory
>
> On Fri, May 09, 2014@01:44:47PM -0700, Sam Bradshaw wrote:
> > The memset to clear the SQE in nvme_submit_iod() is made partially
> > redundant by subsequent writes. This patch explicitly clears each
> > SQE structure member in ascending order, eliminating the need for
> > the memset. With this change, our perf runs show ~1.5% less time
> > spent in the IO submission path and minor reduced q lock contention.
>
> I'm shocked! I thought that zeroing the cacheline first would be
> better
> performing than storing into parts of the cacheline. But I can't argue
> with your numbers. I think your patch is missing a store to the
> metadata
> element though; care to rerun your test with that added?
Yes, thanks for pointing out the missing store to ->metadata.
Without memset:
+ 35.35% fio nvme_process_cq
+ 10.85% fio nvme_make_request
+ 10.64% fio nvme_map_bio
+ 10.49% fio free_cmdid
+ 9.65% fio alloc_cmdid
+ 9.52% fio bio_completion
+ 5.18% fio nvme_submit_iod
+ 2.38% fio nvme_submit_bio_queue
+ 2.26% fio nvme_alloc_iod
+ 2.17% fio nvme_setup_prps
+ 0.90% fio nvme_free_iod
+ 0.61% fio nvme_irq
With memset:
+ 36.24% fio nvme_process_cq
+ 11.04% fio nvme_make_request
+ 10.76% fio nvme_map_bio
+ 9.48% fio free_cmdid
+ 9.24% fio alloc_cmdid
+ 8.51% fio bio_completion
+ 6.33% fio nvme_submit_iod
+ 2.38% fio nvme_submit_bio_queue
+ 2.30% fio nvme_alloc_iod
+ 2.18% fio nvme_setup_prps
+ 0.91% fio nvme_free_iod
+ 0.62% fio nvme_irq
The numbers pretty consistently show nvme_submit_io() taking more
cpu time with the memset. But, to be fair there is run-to-run
variation +/- ~1% in cpu time for any of the bigger offenders.
The test setup is a bit odd for several (deliberate) reasons:
1) we have fewer queue pairs supported by hardware than cpu cores.
2) fio io submission threads are pegged to cores on nodes other than
the nodes owning the SQ/CQ physical memory, causing the SQE update
to cross a QPI link.
3) io submission threads concurrently run on more than one remote
core and contend for the SQ lock.
If you're not so keen on these data as justification, I could rerun
the experiment on a platforms with differing coherency models. May
take a bit of time, though.
-Sam
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-05-12 19:41 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-09 20:44 [PATCH] Remove redundant writes to uncached sqe memory Sam Bradshaw
2014-05-09 21:14 ` Matias Bjørling
2014-05-11 2:54 ` Matthew Wilcox
2014-05-11 2:52 ` Matthew Wilcox
2014-05-12 19:41 ` Sam Bradshaw (sbradshaw)
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.