[PATCH] NVMe: Reduce divide operations

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] NVMe: Reduce divide operations
@ 2014-11-20 22:13 Sam Bradshaw
  2014-11-20 23:41 ` Keith Busch
  2014-11-21 13:50 ` Matthew Wilcox
  0 siblings, 2 replies; 7+ messages in thread
From: Sam Bradshaw @ 2014-11-20 22:13 UTC (permalink / raw)


There are several expensive divide operations in the submit and 
completion paths that can be converted to less expensive arithmetic 
and logical operations.  Profiling shows significant drops in time 
spent in nvme_alloc_iod() under common workloads as a result of this 
change.

Patch is against Jens' for-3.19/drivers branch.

Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 9310fe5..a5e2ebc 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -360,8 +360,14 @@ static __le64 **iod_list(struct nvme_iod *iod)
  */
 static int nvme_npages(unsigned size, struct nvme_dev *dev)
 {
-	unsigned nprps = DIV_ROUND_UP(size + dev->page_size, dev->page_size);
-	return DIV_ROUND_UP(8 * nprps, dev->page_size - 8);
+	unsigned page_size = (1 << dev->page_shift);
+	unsigned nprps = (size >> dev->page_shift) + 1;
+
+	if (size & (page_size - 1))
+		nprps++;
+	if ((nprps << 3) < (page_size - 8))
+		return 1;
+	return DIV_ROUND_UP(nprps << 3, page_size - 8);
 }
 
 static struct nvme_iod *
@@ -384,7 +390,7 @@ nvme_alloc_iod(unsigned nseg, unsigned nbytes, struct nvme_dev *dev, gfp_t gfp)
 
 void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod)
 {
-	const int last_prp = dev->page_size / 8 - 1;
+	const int last_prp = (1 << (dev->page_shift - 3)) - 1;
 	int i;
 	__le64 **list = iod_list(iod);
 	dma_addr_t prp_dma = iod->first_dma;
@@ -459,7 +465,7 @@ int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod, int total_len,
 	__le64 **list = iod_list(iod);
 	dma_addr_t prp_dma;
 	int nprps, i;
-	u32 page_size = dev->page_size;
+	u32 page_size = 1 << dev->page_shift;
 
 	length -= (page_size - offset);
 	if (length <= 0)
@@ -1416,7 +1422,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	aqa = nvmeq->q_depth - 1;
 	aqa |= aqa << 16;
 
-	dev->page_size = 1 << page_shift;
+	dev->page_shift = page_shift;
 
 	dev->ctrl_config = NVME_CC_CSS_NVM;
 	dev->ctrl_config |= (page_shift - 12) << NVME_CC_MPS_SHIFT;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 258945f..f504fb8 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -100,9 +100,9 @@ struct nvme_dev {
 	char firmware_rev[8];
 	u32 max_hw_sectors;
 	u32 stripe_size;
-	u32 page_size;
 	u16 oncs;
 	u16 abort_limit;
+	u8 page_shift;
 	u8 event_limit;
 	u8 vwc;
 	u8 initialized;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Reduce divide operations
  2014-11-20 22:13 [PATCH] NVMe: Reduce divide operations Sam Bradshaw
@ 2014-11-20 23:41 ` Keith Busch
  2014-11-21 17:36   ` Sam Bradshaw (sbradshaw)
  2014-11-21 13:50 ` Matthew Wilcox
  1 sibling, 1 reply; 7+ messages in thread
From: Keith Busch @ 2014-11-20 23:41 UTC (permalink / raw)


On Thu, 20 Nov 2014, Sam Bradshaw wrote:
> There are several expensive divide operations in the submit and
> completion paths that can be converted to less expensive arithmetic
> and logical operations.  Profiling shows significant drops in time
> spent in nvme_alloc_iod() under common workloads as a result of this
> change.

Very cool. I didn't see a difference on my processor's TSC when I added
even more divides to the IO path for mismatched page size support,
but I was afraid it'd have higher cost elsewhere. Thanks for the patch.

> Patch is against Jens' for-3.19/drivers branch.
>
> Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
> ---
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index 9310fe5..a5e2ebc 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -360,8 +360,14 @@ static __le64 **iod_list(struct nvme_iod *iod)
>  */
> static int nvme_npages(unsigned size, struct nvme_dev *dev)
> {
> -	unsigned nprps = DIV_ROUND_UP(size + dev->page_size, dev->page_size);
> -	return DIV_ROUND_UP(8 * nprps, dev->page_size - 8);
> +	unsigned page_size = (1 << dev->page_shift);
> +	unsigned nprps = (size >> dev->page_shift) + 1;
> +
> +	if (size & (page_size - 1))
> +		nprps++;
> +	if ((nprps << 3) < (page_size - 8))
> +		return 1;

You actually don't need to subtract 8 here. That's for reserving a place
for chaining a PRP list, but we don't need to reserve a place if all
the entries fit in page.

> +	return DIV_ROUND_UP(nprps << 3, page_size - 8);

Can we get rid of this divide too?!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Reduce divide operations
  2014-11-20 22:13 [PATCH] NVMe: Reduce divide operations Sam Bradshaw
  2014-11-20 23:41 ` Keith Busch
@ 2014-11-21 13:50 ` Matthew Wilcox
  2014-11-21 15:52   ` Keith Busch
  2014-11-21 17:38   ` Sam Bradshaw (sbradshaw)
  1 sibling, 2 replies; 7+ messages in thread
From: Matthew Wilcox @ 2014-11-21 13:50 UTC (permalink / raw)

On Thu, Nov 20, 2014@02:13:24PM -0800, Sam Bradshaw wrote:
> There are several expensive divide operations in the submit and 
> completion paths that can be converted to less expensive arithmetic 
> and logical operations.  Profiling shows significant drops in time 
> spent in nvme_alloc_iod() under common workloads as a result of this 
> change.

OK ... but I think you've taken this a bit far.

>  static int nvme_npages(unsigned size, struct nvme_dev *dev)
>  {
> -	unsigned nprps = DIV_ROUND_UP(size + dev->page_size, dev->page_size);
> -	return DIV_ROUND_UP(8 * nprps, dev->page_size - 8);
> +	unsigned page_size = (1 << dev->page_shift);
> +	unsigned nprps = (size >> dev->page_shift) + 1;
> +
> +	if (size & (page_size - 1))
> +		nprps++;
> +	if ((nprps << 3) < (page_size - 8))
> +		return 1;
> +	return DIV_ROUND_UP(nprps << 3, page_size - 8);
>  }

I don't think there's a compiler in the world that doesn't optimise
'x * 8' into 'x << 3' if the latter is cheaper.

Also, I think your nprps is now slightly larger than it used to be.
Shouldn't it look like this?

	unsigned nprps = size >> dev->page_shift;
	if (size & (page_size - 1))
		nprps++;

Let's imagine we're sending a misaligned 16k I/O.  We need 5 PRP entries,
but the first one goes in the command, so we need to allocate enough
space for 4 entries.  16k / 4k is 4, but your code says to add 1.

We can actually do slightly better than the original code in the case
where we're sending a 2MB I/O.  The code in the driver today thinks we
need a second page, but the last entry on the PRP page will be a PRP
Entry, not a PRP List Entry.  So I think this is the right calculation:

 static int nvme_npages(unsigned size, struct nvme_dev *dev)
 {
-	unsigned nprps = DIV_ROUND_UP(size + dev->page_size, dev->page_size);
-	return DIV_ROUND_UP(8 * nprps, dev->page_size - 8);
+	unsigned page_size = (1 << dev->page_shift);
+	unsigned nprps = size >> dev->page_shift;
+
+	if (size & (page_size - 1))
+		nprps++;
+	if ((nprps * 8) <= page_size)
+		return 1;
+	return DIV_ROUND_UP(nprps * 8, page_size - 8);
 }

(it still overestimates for I/Os on 2MB boundaries that are larger than
2MB, but I'm OK with that.  If somebody wants to come up with a neater
calculation, feel free).

We could further optimise it by knowing that we need 0 pages if the I/O
is <= 8k in size:

+	unsigned nprps,	page_size = (1 << dev->page_shift);
+
+	if (size <= page_size * 2)
+		return 0;
+	nprps = size >> dev->page_shift;
+	if (size & (page_size - 1))
...

but I'm not sure that it's worth saving those 8 bytes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Reduce divide operations
  2014-11-21 13:50 ` Matthew Wilcox
@ 2014-11-21 15:52   ` Keith Busch
  2014-11-21 17:14     ` Matthew Wilcox
  2014-11-21 17:38   ` Sam Bradshaw (sbradshaw)
  1 sibling, 1 reply; 7+ messages in thread
From: Keith Busch @ 2014-11-21 15:52 UTC (permalink / raw)


On Fri, 21 Nov 2014, Matthew Wilcox wrote:
> We could further optimise it by knowing that we need 0 pages if the I/O
> is <= 8k in size:

I think that'll break if the first page has a non-zero offset, so
we'd have to allow the overestimate if we want to keep this function
simple. We could micro-optimize to return 0 if it's <= 4k, though.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Reduce divide operations
  2014-11-21 15:52   ` Keith Busch
@ 2014-11-21 17:14     ` Matthew Wilcox
  0 siblings, 0 replies; 7+ messages in thread
From: Matthew Wilcox @ 2014-11-21 17:14 UTC (permalink / raw)


On Fri, Nov 21, 2014@03:52:08PM +0000, Keith Busch wrote:
> On Fri, 21 Nov 2014, Matthew Wilcox wrote:
> >We could further optimise it by knowing that we need 0 pages if the I/O
> >is <= 8k in size:
> 
> I think that'll break if the first page has a non-zero offset, so
> we'd have to allow the overestimate if we want to keep this function
> simple. We could micro-optimize to return 0 if it's <= 4k, though.

Yes, sorry, you're right.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Reduce divide operations
  2014-11-20 23:41 ` Keith Busch
@ 2014-11-21 17:36   ` Sam Bradshaw (sbradshaw)
  0 siblings, 0 replies; 7+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2014-11-21 17:36 UTC (permalink / raw)




> > Patch is against Jens' for-3.19/drivers branch.
> >
> > Signed-off-by: Sam Bradshaw <sbradshaw at micron.com>
> > ---
> > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > index 9310fe5..a5e2ebc 100644
> > --- a/drivers/block/nvme-core.c
> > +++ b/drivers/block/nvme-core.c
> > @@ -360,8 +360,14 @@ static __le64 **iod_list(struct nvme_iod *iod)
> > */ static int nvme_npages(unsigned size, struct nvme_dev *dev) {
> > -	unsigned nprps = DIV_ROUND_UP(size + dev->page_size, dev-
> >page_size);
> > -	return DIV_ROUND_UP(8 * nprps, dev->page_size - 8);
> > +	unsigned page_size = (1 << dev->page_shift);
> > +	unsigned nprps = (size >> dev->page_shift) + 1;
> > +
> > +	if (size & (page_size - 1))
> > +		nprps++;
> > +	if ((nprps << 3) < (page_size - 8))
> > +		return 1;
> 
> You actually don't need to subtract 8 here. That's for reserving a
> place for chaining a PRP list, but we don't need to reserve a place if
> all the entries fit in page.

Yes, you're right.  Missed that.

> > +	return DIV_ROUND_UP(nprps << 3, page_size - 8);
> 
> Can we get rid of this divide too?!

I was thinking of another comparison for ((nprps << 3) < (page_size * 2)) 
would eliminate the div for 99% of IO sizes.  I see Matthew already had 
the same thought.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Reduce divide operations
  2014-11-21 13:50 ` Matthew Wilcox
  2014-11-21 15:52   ` Keith Busch
@ 2014-11-21 17:38   ` Sam Bradshaw (sbradshaw)
  1 sibling, 0 replies; 7+ messages in thread
From: Sam Bradshaw (sbradshaw) @ 2014-11-21 17:38 UTC (permalink / raw)



> I don't think there's a compiler in the world that doesn't optimise 'x
> * 8' into 'x << 3' if the latter is cheaper.
> 
> Also, I think your nprps is now slightly larger than it used to be.
> Shouldn't it look like this?
> 
> 	unsigned nprps = size >> dev->page_shift;
> 	if (size & (page_size - 1))
> 		nprps++;
> 
> Let's imagine we're sending a misaligned 16k I/O.  We need 5 PRP
> entries, but the first one goes in the command, so we need to allocate
> enough space for 4 entries.  16k / 4k is 4, but your code says to add 1.
> 
> We can actually do slightly better than the original code in the case
> where we're sending a 2MB I/O.  The code in the driver today thinks we
> need a second page, but the last entry on the PRP page will be a PRP
> Entry, not a PRP List Entry.  So I think this is the right calculation:
> 
>  static int nvme_npages(unsigned size, struct nvme_dev *dev)  {
> -	unsigned nprps = DIV_ROUND_UP(size + dev->page_size, dev-
> >page_size);
> -	return DIV_ROUND_UP(8 * nprps, dev->page_size - 8);
> +	unsigned page_size = (1 << dev->page_shift);
> +	unsigned nprps = size >> dev->page_shift;
> +
> +	if (size & (page_size - 1))
> +		nprps++;
> +	if ((nprps * 8) <= page_size)
> +		return 1;
> +	return DIV_ROUND_UP(nprps * 8, page_size - 8);
>  }
> 
> (it still overestimates for I/Os on 2MB boundaries that are larger than
> 2MB, but I'm OK with that.  If somebody wants to come up with a neater
> calculation, feel free).

Seems reasonable to me.

> We could further optimise it by knowing that we need 0 pages if the I/O
> is <= 8k in size:
> 
> +	unsigned nprps,	page_size = (1 << dev->page_shift);
> +
> +	if (size <= page_size * 2)
> +		return 0;
> +	nprps = size >> dev->page_shift;
> +	if (size & (page_size - 1))
> ...
> 
> but I'm not sure that it's worth saving those 8 bytes.

Agreed.  I had originally thought of eliminating nvme_npages()
altogether in favor of a worst case padding for the iod but decided
to stick with the overall theme of saving bytes here and there.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-11-21 17:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-20 22:13 [PATCH] NVMe: Reduce divide operations Sam Bradshaw
2014-11-20 23:41 ` Keith Busch
2014-11-21 17:36   ` Sam Bradshaw (sbradshaw)
2014-11-21 13:50 ` Matthew Wilcox
2014-11-21 15:52   ` Keith Busch
2014-11-21 17:14     ` Matthew Wilcox
2014-11-21 17:38   ` Sam Bradshaw (sbradshaw)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.