[PATCH] nvme: use __GFP_NOWARN for iod allocation

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
@ 2018-06-20 19:10 Jens Axboe
  2018-06-20 19:16 ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-20 19:10 UTC (permalink / raw)


The iod can get quite large, requiring higher order allocations that
we know can fail. Don't warn about them, as that just tends to flod
the log with information we don't care about.

Signed-off-by: Jens Axboe <axboe at kernel.dk>

---

This begs the question is we should limit the size in general. The
command overhead is low enough that I think we should default to
something sane that doesn't require _any_ > 0 order allocations.

Our default 1280kb will require 10248 bytes of iod, which is an
order 2 allocation.... That's not helping tail latencies in
data centers, where memory is always full and fragmented.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..1c97d6356484 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -480,7 +480,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 		size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
 				iod->use_sgl);
 
-		iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
+		iod->sg = kmalloc(alloc_size, GFP_ATOMIC | __GFP_NOWARN);
 		if (!iod->sg)
 			return BLK_STS_RESOURCE;
 	} else {

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 19:10 [PATCH] nvme: use __GFP_NOWARN for iod allocation Jens Axboe
@ 2018-06-20 19:16 ` Jens Axboe
  2018-06-20 20:09   ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-20 19:16 UTC (permalink / raw)


On 6/20/18 1:10 PM, Jens Axboe wrote:
> This begs the question is we should limit the size in general. The
> command overhead is low enough that I think we should default to
> something sane that doesn't require _any_ > 0 order allocations.
> 
> Our default 1280kb will require 10248 bytes of iod, which is an
> order 2 allocation.... That's not helping tail latencies in
> data centers, where memory is always full and fragmented.

In terms of sizing, defaulting to 256kb might not be a bad idea.
That'd be 2056 bytes for a normal config. Alternatively we could
go 1280/4 == 320k, which would be 2568 bytes.

In any case, I think it's something that's worth doing.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 19:16 ` Jens Axboe
@ 2018-06-20 20:09   ` Jens Axboe
  2018-06-20 20:43     ` Keith Busch
                       ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Jens Axboe @ 2018-06-20 20:09 UTC (permalink / raw)


On 6/20/18 1:16 PM, Jens Axboe wrote:
> On 6/20/18 1:10 PM, Jens Axboe wrote:
>> This begs the question is we should limit the size in general. The
>> command overhead is low enough that I think we should default to
>> something sane that doesn't require _any_ > 0 order allocations.
>>
>> Our default 1280kb will require 10248 bytes of iod, which is an
>> order 2 allocation.... That's not helping tail latencies in
>> data centers, where memory is always full and fragmented.
> 
> In terms of sizing, defaulting to 256kb might not be a bad idea.
> That'd be 2056 bytes for a normal config. Alternatively we could
> go 1280/4 == 320k, which would be 2568 bytes.
> 
> In any case, I think it's something that's worth doing.

Talking more to myself - higher order allocations needed to make
progress isn't a viable approach. There may _never_ be any available,
which means we are stuck since this isn't backed by a mempool.

How about something like the below. Default to a size that will
never require > PAGE_SIZE allocation, and back that with a single
entry mempool. Even if we don't wait on it, if we keep retrying
we'll always be able to allocate memory and make forward progress.
And allocating a single page is a hell of a lot more likely than
a 2nd order allocation.


diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21710a7460c8..fe301423d06c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -37,6 +37,12 @@
 
 #define NVME_MINORS		(1U << MINORBITS)
 
+/*
+ * This can be higher, but we need to ensure that any command doesn't
+ * require an iod->sg allocation that needs more than a page of aux data.
+ */
+#define NVME_MAX_SECTORS	512
+
 unsigned int admin_timeout = 60;
 module_param(admin_timeout, uint, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -2378,7 +2384,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
 	else
-		max_hw_sectors = UINT_MAX;
+		max_hw_sectors = NVME_MAX_SECTORS;
 	ctrl->max_hw_sectors =
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..603998e7d7a6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -100,6 +100,8 @@ struct nvme_dev {
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 
+	mempool_t *iod_mempool;
+
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -477,10 +479,10 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->use_sgl = nvme_pci_use_sgls(dev, rq);
 
 	if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
-		size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
-				iod->use_sgl);
-
-		iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
+		BUG_ON(nvme_pci_iod_alloc_size(dev, size,
+						blk_rq_nr_phys_segments(rq),
+						iod->use_sgl) > PAGE_SIZE);
+		iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 		if (!iod->sg)
 			return BLK_STS_RESOURCE;
 	} else {
@@ -526,7 +528,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
 	}
 
 	if (iod->sg != iod->inline_sg)
-		kfree(iod->sg);
+		mempool_free(iod->sg, dev->iod_mempool);
 }
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
@@ -2280,6 +2282,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
 	free_opal_dev(dev->ctrl.opal_dev);
+	mempool_destroy(dev->iod_mempool);
 	kfree(dev);
 }
 
@@ -2546,6 +2549,14 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
+	dev->iod_mempool = mempool_create_node(1, mempool_alloc_pages,
+						mempool_free_pages, (void *) 0,
+						GFP_ATOMIC, node);
+	if (!dev->iod_mempool) {
+		result = -ENOMEM;
+		goto release_pools;
+	}
+
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	nvme_get_ctrl(&dev->ctrl);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 20:09   ` Jens Axboe
@ 2018-06-20 20:43     ` Keith Busch
  2018-06-20 20:54       ` Jens Axboe
  2018-06-20 20:51     ` Jens Axboe
  2018-06-21  7:09     ` Christoph Hellwig
  2 siblings, 1 reply; 33+ messages in thread
From: Keith Busch @ 2018-06-20 20:43 UTC (permalink / raw)


On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote:
> On 6/20/18 1:16 PM, Jens Axboe wrote:
> > On 6/20/18 1:10 PM, Jens Axboe wrote:
> >> This begs the question is we should limit the size in general. The
> >> command overhead is low enough that I think we should default to
> >> something sane that doesn't require _any_ > 0 order allocations.
> >>
> >> Our default 1280kb will require 10248 bytes of iod, which is an
> >> order 2 allocation.... That's not helping tail latencies in
> >> data centers, where memory is always full and fragmented.
> > 
> > In terms of sizing, defaulting to 256kb might not be a bad idea.
> > That'd be 2056 bytes for a normal config. Alternatively we could
> > go 1280/4 == 320k, which would be 2568 bytes.
> > 
> > In any case, I think it's something that's worth doing.
> 
> Talking more to myself - higher order allocations needed to make
> progress isn't a viable approach. There may _never_ be any available,
> which means we are stuck since this isn't backed by a mempool.
> 
> How about something like the below. Default to a size that will
> never require > PAGE_SIZE allocation, and back that with a single
> entry mempool. Even if we don't wait on it, if we keep retrying
> we'll always be able to allocate memory and make forward progress.
> And allocating a single page is a hell of a lot more likely than
> a 2nd order allocation.

If we go this route, there is a great deal of cleanup that can follow
on. For example, we'll never need to chain PRPs, so nvme_iod setup/free
can become a lot simpler, and the struct iod npages and hidden "iod_list"
member can go away.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 20:43     ` Keith Busch
@ 2018-06-20 20:54       ` Jens Axboe
  2018-06-20 22:30         ` Keith Busch
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-20 20:54 UTC (permalink / raw)


On 6/20/18 2:43 PM, Keith Busch wrote:
> On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote:
>> On 6/20/18 1:16 PM, Jens Axboe wrote:
>>> On 6/20/18 1:10 PM, Jens Axboe wrote:
>>>> This begs the question is we should limit the size in general. The
>>>> command overhead is low enough that I think we should default to
>>>> something sane that doesn't require _any_ > 0 order allocations.
>>>>
>>>> Our default 1280kb will require 10248 bytes of iod, which is an
>>>> order 2 allocation.... That's not helping tail latencies in
>>>> data centers, where memory is always full and fragmented.
>>>
>>> In terms of sizing, defaulting to 256kb might not be a bad idea.
>>> That'd be 2056 bytes for a normal config. Alternatively we could
>>> go 1280/4 == 320k, which would be 2568 bytes.
>>>
>>> In any case, I think it's something that's worth doing.
>>
>> Talking more to myself - higher order allocations needed to make
>> progress isn't a viable approach. There may _never_ be any available,
>> which means we are stuck since this isn't backed by a mempool.
>>
>> How about something like the below. Default to a size that will
>> never require > PAGE_SIZE allocation, and back that with a single
>> entry mempool. Even if we don't wait on it, if we keep retrying
>> we'll always be able to allocate memory and make forward progress.
>> And allocating a single page is a hell of a lot more likely than
>> a 2nd order allocation.
> 
> If we go this route, there is a great deal of cleanup that can follow
> on. For example, we'll never need to chain PRPs, so nvme_iod setup/free
> can become a lot simpler, and the struct iod npages and hidden "iod_list"
> member can go away.

Missed this one before sending the other one out - yeah, I agree.
And since we already have the inline sg for smaller commands, I
think it makes a lot of sense to move in this direction.

I should also mention that this came out of a real life hang
on a production box. I suppose we haven't seen a lot of these
since a lot of nvme controllers have a hw limit of 128k, but
the ones that don't can pretty easily hit the hang of needing
2^2 pages and failing miserably forever.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 20:54       ` Jens Axboe
@ 2018-06-20 22:30         ` Keith Busch
  2018-06-20 22:30           ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Keith Busch @ 2018-06-20 22:30 UTC (permalink / raw)

On Wed, Jun 20, 2018@02:54:26PM -0600, Jens Axboe wrote:
> I should also mention that this came out of a real life hang
> on a production box. I suppose we haven't seen a lot of these
> since a lot of nvme controllers have a hw limit of 128k, but
> the ones that don't can pretty easily hit the hang of needing
> 2^2 pages and failing miserably forever.

I can certainly believe that. I'm totally fine with setting a smaller
upper bound on the IO size. There really isn't a meaningful bandwidth
advantage to larger transfers on NVMe PCIe once you get over 256k anyway.

Since you're making the change to core, let's wait to hear if there are
any concerns on the fabrics front.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 22:30         ` Keith Busch
@ 2018-06-20 22:30           ` Jens Axboe
  2018-06-21  7:10             ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-20 22:30 UTC (permalink / raw)


On 6/20/18 4:30 PM, Keith Busch wrote:
> On Wed, Jun 20, 2018@02:54:26PM -0600, Jens Axboe wrote:
>> I should also mention that this came out of a real life hang
>> on a production box. I suppose we haven't seen a lot of these
>> since a lot of nvme controllers have a hw limit of 128k, but
>> the ones that don't can pretty easily hit the hang of needing
>> 2^2 pages and failing miserably forever.
> 
> I can certainly believe that. I'm totally fine with setting a smaller
> upper bound on the IO size. There really isn't a meaningful bandwidth
> advantage to larger transfers on NVMe PCIe once you get over 256k anyway.

Right. We can retain most of the maximum size there, but the segments
must go down. But functionally it's the same, and we're well within
the region that should be plenty fast.

> Since you're making the change to core, let's wait to hear if there are
> any concerns on the fabrics front.

If the fabrics folks have concerns, we could make it dependent on
it being a pci device. That's the case I care about anyway, and the
one that's currently breaking for me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 22:30           ` Jens Axboe
@ 2018-06-21  7:10             ` Christoph Hellwig
  2018-06-21 14:47               ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21  7:10 UTC (permalink / raw)


On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote:
> If the fabrics folks have concerns, we could make it dependent on
> it being a pci device. That's the case I care about anyway, and the
> one that's currently breaking for me.

Fabrics uses the SG chain allocator that SCSI also uses, so it
should already be fine, including for larger I/O sizes.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21  7:10             ` Christoph Hellwig
@ 2018-06-21 14:47               ` Jens Axboe
  2018-06-21 15:01                 ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 14:47 UTC (permalink / raw)


On 6/21/18 1:10 AM, Christoph Hellwig wrote:
> On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote:
>> If the fabrics folks have concerns, we could make it dependent on
>> it being a pci device. That's the case I care about anyway, and the
>> one that's currently breaking for me.
> 
> Fabrics uses the SG chain allocator that SCSI also uses, so it
> should already be fine, including for larger I/O sizes.

I'll change it to ignore it for fabrics. What's the best way to
check? ctrl->sqsize?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 14:47               ` Jens Axboe
@ 2018-06-21 15:01                 ` Christoph Hellwig
  2018-06-21 15:04                   ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 15:01 UTC (permalink / raw)


On Thu, Jun 21, 2018@08:47:34AM -0600, Jens Axboe wrote:
> On 6/21/18 1:10 AM, Christoph Hellwig wrote:
> > On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote:
> >> If the fabrics folks have concerns, we could make it dependent on
> >> it being a pci device. That's the case I care about anyway, and the
> >> one that's currently breaking for me.
> > 
> > Fabrics uses the SG chain allocator that SCSI also uses, so it
> > should already be fine, including for larger I/O sizes.
> 
> I'll change it to ignore it for fabrics. What's the best way to
> check? ctrl->sqsize?

ctrl->ops->flags & NVME_F_FABRICS

but in the end we probably want a separate flag, as that is an
implementation detail, and not because these are fabrics drivers.

I'll be spending the rest of the day on phone calls, so it'll take
some time to do more on this than writing some email, but I'll try
to find some time to look into these issues.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:01                 ` Christoph Hellwig
@ 2018-06-21 15:04                   ` Jens Axboe
  2018-06-21 15:06                     ` Jens Axboe
  2018-06-21 15:24                     ` Christoph Hellwig
  0 siblings, 2 replies; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:04 UTC (permalink / raw)


On 6/21/18 9:01 AM, Christoph Hellwig wrote:
> On Thu, Jun 21, 2018@08:47:34AM -0600, Jens Axboe wrote:
>> On 6/21/18 1:10 AM, Christoph Hellwig wrote:
>>> On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote:
>>>> If the fabrics folks have concerns, we could make it dependent on
>>>> it being a pci device. That's the case I care about anyway, and the
>>>> one that's currently breaking for me.
>>>
>>> Fabrics uses the SG chain allocator that SCSI also uses, so it
>>> should already be fine, including for larger I/O sizes.
>>
>> I'll change it to ignore it for fabrics. What's the best way to
>> check? ctrl->sqsize?
> 
> ctrl->ops->flags & NVME_F_FABRICS
> 
> but in the end we probably want a separate flag, as that is an
> implementation detail, and not because these are fabrics drivers.

Probably fine for now with a comment, since it is pretty tightly
coupled to the fact that we prefer larger commands on fabrics, and
don't care so much about that for a local controller.

For now the check is moot though, since we are doing the same
allocations...

> I'll be spending the rest of the day on phone calls, so it'll take
> some time to do more on this than writing some email, but I'll try
> to find some time to look into these issues.

OK sounds good, would be nice to get this hashed out as I think some
variant of this should go into 4.18.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:04                   ` Jens Axboe
@ 2018-06-21 15:06                     ` Jens Axboe
  2018-06-21 15:24                       ` Christoph Hellwig
  2018-06-21 15:24                     ` Christoph Hellwig
  1 sibling, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:06 UTC (permalink / raw)


On 6/21/18 9:04 AM, Jens Axboe wrote:
> On 6/21/18 9:01 AM, Christoph Hellwig wrote:
>> On Thu, Jun 21, 2018@08:47:34AM -0600, Jens Axboe wrote:
>>> On 6/21/18 1:10 AM, Christoph Hellwig wrote:
>>>> On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote:
>>>>> If the fabrics folks have concerns, we could make it dependent on
>>>>> it being a pci device. That's the case I care about anyway, and the
>>>>> one that's currently breaking for me.
>>>>
>>>> Fabrics uses the SG chain allocator that SCSI also uses, so it
>>>> should already be fine, including for larger I/O sizes.
>>>
>>> I'll change it to ignore it for fabrics. What's the best way to
>>> check? ctrl->sqsize?
>>
>> ctrl->ops->flags & NVME_F_FABRICS
>>
>> but in the end we probably want a separate flag, as that is an
>> implementation detail, and not because these are fabrics drivers.
> 
> Probably fine for now with a comment, since it is pretty tightly
> coupled to the fact that we prefer larger commands on fabrics, and
> don't care so much about that for a local controller.
> 
> For now the check is moot though, since we are doing the same
> allocations...

For size I guess it's fine, as long as we limit segments. For
reference, current version below.


diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21710a7460c8..f40d46668532 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1808,8 +1808,9 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		u32 max_segments =
 			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1;
 
+		max_segments = min_not_zero(max_segments, (u32)NVME_MAX_SEGS);
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
-		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
+		blk_queue_max_segments(q, max_segments);
 	}
 	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) &&
 	    is_power_of_2(ctrl->max_hw_sectors))
@@ -2377,8 +2378,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
-	else
+	else if (ctrl->ops->flags & NVME_F_FABRICS)
 		max_hw_sectors = UINT_MAX;
+	else
+		max_hw_sectors = NVME_MAX_KB_SZ << 1;
 	ctrl->max_hw_sectors =
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 231807cbc849..efce853a6638 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -33,6 +33,13 @@ extern unsigned int admin_timeout;
 #define NVME_DEFAULT_KATO	5
 #define NVME_KATO_GRACE		10
 
+/*
+ * These can be higher, but we need to ensure that any command doesn't
+ * require an sg allocation that needs more than a page of data.
+ */
+#define NVME_MAX_KB_SZ	4096
+#define NVME_MAX_SEGS	127
+
 extern struct workqueue_struct *nvme_wq;
 extern struct workqueue_struct *nvme_reset_wq;
 extern struct workqueue_struct *nvme_delete_wq;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..d5f603b05394 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -100,6 +100,8 @@ struct nvme_dev {
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 
+	mempool_t *iod_mempool;
+
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -477,10 +479,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->use_sgl = nvme_pci_use_sgls(dev, rq);
 
 	if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
-		size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
-				iod->use_sgl);
-
-		iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
+		iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 		if (!iod->sg)
 			return BLK_STS_RESOURCE;
 	} else {
@@ -526,7 +525,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
 	}
 
 	if (iod->sg != iod->inline_sg)
-		kfree(iod->sg);
+		mempool_free(iod->sg, dev->iod_mempool);
 }
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
@@ -2280,6 +2279,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
 	free_opal_dev(dev->ctrl.opal_dev);
+	mempool_destroy(dev->iod_mempool);
 	kfree(dev);
 }
 
@@ -2509,6 +2509,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
 	unsigned long quirks = id->driver_data;
+	size_t alloc_size;
 
 	node = dev_to_node(&pdev->dev);
 	if (node == NUMA_NO_NODE)
@@ -2546,6 +2547,18 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
+	alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ,
+						NVME_MAX_SEGS, true);
+	WARN_ON_ONCE(alloc_size > PAGE_SIZE);
+	dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
+						mempool_kfree,
+						(void *) alloc_size,
+						GFP_KERNEL, node);
+	if (!dev->iod_mempool) {
+		result = -ENOMEM;
+		goto release_pools;
+	}
+
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	nvme_get_ctrl(&dev->ctrl);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:06                     ` Jens Axboe
@ 2018-06-21 15:24                       ` Christoph Hellwig
  2018-06-21 15:26                         ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 15:24 UTC (permalink / raw)


I guess we can live with it for now with a big comment

>  	ctrl->cntlid = le16_to_cpup(&id->cntlid);
>  	if (id->mdts)
>  		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
> -	else
> +	else if (ctrl->ops->flags & NVME_F_FABRICS)
>  		max_hw_sectors = UINT_MAX;
> +	else
> +		max_hw_sectors = NVME_MAX_KB_SZ << 1;

Can this I think is better done by setting ctrl->max_hw_sectors
in the PCIe driver (similar to what the other drivers already do)
and then rely on the min_not_zero below to factor in MDTS.

> +/*
> + * These can be higher, but we need to ensure that any command doesn't
> + * require an sg allocation that needs more than a page of data.
> + */
> +#define NVME_MAX_KB_SZ	4096

I think this should move the pci.c

> +#define NVME_MAX_SEGS	127

And this could use a good explanation.  Especially as I think
that we shouldn't apply the limit for fabrics.  So instead we
might just add a new ctrl->max_segments field, which PCI sets,
and which then is min() with the otherwise calculated value.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:24                       ` Christoph Hellwig
@ 2018-06-21 15:26                         ` Jens Axboe
  0 siblings, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:26 UTC (permalink / raw)


On 6/21/18 9:24 AM, Christoph Hellwig wrote:
> I guess we can live with it for now with a big comment
> 
>>  	ctrl->cntlid = le16_to_cpup(&id->cntlid);
>>  	if (id->mdts)
>>  		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
>> -	else
>> +	else if (ctrl->ops->flags & NVME_F_FABRICS)
>>  		max_hw_sectors = UINT_MAX;
>> +	else
>> +		max_hw_sectors = NVME_MAX_KB_SZ << 1;
> 
> Can this I think is better done by setting ctrl->max_hw_sectors
> in the PCIe driver (similar to what the other drivers already do)
> and then rely on the min_not_zero below to factor in MDTS.

Yeah, I can put it there, and we can get rid of the check here.
Maybe do the same for segments, adding a ctrl->max_segments.
If non-zero, use that in init_ctrl().

>> +/*
>> + * These can be higher, but we need to ensure that any command doesn't
>> + * require an sg allocation that needs more than a page of data.
>> + */
>> +#define NVME_MAX_KB_SZ	4096
> 
> I think this should move the pci.c

With the above change it can.

>> +#define NVME_MAX_SEGS	127
> 
> And this could use a good explanation.  Especially as I think
> that we shouldn't apply the limit for fabrics.  So instead we
> might just add a new ctrl->max_segments field, which PCI sets,
> and which then is min() with the otherwise calculated value.

I'll add that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:04                   ` Jens Axboe
  2018-06-21 15:06                     ` Jens Axboe
@ 2018-06-21 15:24                     ` Christoph Hellwig
  2018-06-21 15:34                       ` Jens Axboe
  1 sibling, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 15:24 UTC (permalink / raw)


On Thu, Jun 21, 2018@09:04:14AM -0600, Jens Axboe wrote:
> > I'll be spending the rest of the day on phone calls, so it'll take
> > some time to do more on this than writing some email, but I'll try
> > to find some time to look into these issues.
> 
> OK sounds good, would be nice to get this hashed out as I think some
> variant of this should go into 4.18.

Fine with me, but I'd rather not rush it out this week to give us
a little more time for details and aim for merging it to 4.18 next
week.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:24                     ` Christoph Hellwig
@ 2018-06-21 15:34                       ` Jens Axboe
  2018-06-21 15:45                         ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:34 UTC (permalink / raw)


On 6/21/18 9:24 AM, Christoph Hellwig wrote:
> On Thu, Jun 21, 2018@09:04:14AM -0600, Jens Axboe wrote:
>>> I'll be spending the rest of the day on phone calls, so it'll take
>>> some time to do more on this than writing some email, but I'll try
>>> to find some time to look into these issues.
>>
>> OK sounds good, would be nice to get this hashed out as I think some
>> variant of this should go into 4.18.
> 
> Fine with me, but I'd rather not rush it out this week to give us
> a little more time for details and aim for merging it to 4.18 next
> week.

Oh, I'm totally fine with that, no rush outside of just wanting to
ensure it ends up in 4.18. But there's plenty of time for that.

Below is cleaner I think, keeping it all in pci.c and fabrics should
just work. I'll test...

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21710a7460c8..46df030b2c3f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1808,6 +1808,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		u32 max_segments =
 			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1;
 
+		max_segments = min_not_zero(max_segments, ctrl->max_segments);
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
 		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
 	}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 231807cbc849..0c4a33df3b2f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -170,6 +170,7 @@ struct nvme_ctrl {
 	u64 cap;
 	u32 page_size;
 	u32 max_hw_sectors;
+	u32 max_segments;
 	u16 oncs;
 	u16 oacs;
 	u16 nssa;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..276acc045983 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -38,6 +38,13 @@
 
 #define SGES_PER_PAGE	(PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
+/*
+ * These can be higher, but we need to ensure that any command doesn't
+ * require an sg allocation that needs more than a page of data.
+ */
+#define NVME_MAX_KB_SZ	4096
+#define NVME_MAX_SEGS	127
+
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0);
 
@@ -100,6 +107,8 @@ struct nvme_dev {
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 
+	mempool_t *iod_mempool;
+
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -477,10 +486,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->use_sgl = nvme_pci_use_sgls(dev, rq);
 
 	if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
-		size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
-				iod->use_sgl);
-
-		iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
+		iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 		if (!iod->sg)
 			return BLK_STS_RESOURCE;
 	} else {
@@ -526,7 +532,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
 	}
 
 	if (iod->sg != iod->inline_sg)
-		kfree(iod->sg);
+		mempool_free(iod->sg, dev->iod_mempool);
 }
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
@@ -2280,6 +2286,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
 	free_opal_dev(dev->ctrl.opal_dev);
+	mempool_destroy(dev->iod_mempool);
 	kfree(dev);
 }
 
@@ -2333,6 +2340,13 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
+	/*
+	 * Limit the max command size to prevent iod->sg allocations going
+	 * over a single page.
+	 */
+	dev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
+	dev->ctrl.max_segments = NVME_MAX_SEGS;
+
 	result = nvme_init_identify(&dev->ctrl);
 	if (result)
 		goto out;
@@ -2509,6 +2523,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
 	unsigned long quirks = id->driver_data;
+	size_t alloc_size;
 
 	node = dev_to_node(&pdev->dev);
 	if (node == NUMA_NO_NODE)
@@ -2546,6 +2561,23 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
+	/*
+	 * Double check that our mempool alloc size will cover the biggest
+	 * command we support.
+	 */
+	alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ,
+						NVME_MAX_SEGS, true);
+	WARN_ON_ONCE(alloc_size > PAGE_SIZE);
+
+	dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
+						mempool_kfree,
+						(void *) alloc_size,
+						GFP_KERNEL, node);
+	if (!dev->iod_mempool) {
+		result = -ENOMEM;
+		goto release_pools;
+	}
+
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	nvme_get_ctrl(&dev->ctrl);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:34                       ` Jens Axboe
@ 2018-06-21 15:45                         ` Christoph Hellwig
  2018-06-21 15:49                           ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 15:45 UTC (permalink / raw)


On Thu, Jun 21, 2018@09:34:02AM -0600, Jens Axboe wrote:
> Oh, I'm totally fine with that, no rush outside of just wanting to
> ensure it ends up in 4.18. But there's plenty of time for that.
> 
> Below is cleaner I think, keeping it all in pci.c and fabrics should
> just work. I'll test...

Ok, this actually looks harmless enough that with successfull testing
and an ack from Keith we could queue it up this week..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:45                         ` Christoph Hellwig
@ 2018-06-21 15:49                           ` Jens Axboe
  2018-06-21 16:27                             ` Keith Busch
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:49 UTC (permalink / raw)


On 6/21/18 9:45 AM, Christoph Hellwig wrote:
> On Thu, Jun 21, 2018@09:34:02AM -0600, Jens Axboe wrote:
>> Oh, I'm totally fine with that, no rush outside of just wanting to
>> ensure it ends up in 4.18. But there's plenty of time for that.
>>
>> Below is cleaner I think, keeping it all in pci.c and fabrics should
>> just work. I'll test...
> 
> Ok, this actually looks harmless enough that with successfull testing
> and an ack from Keith we could queue it up this week..

Tested it here now (laptop and box with 6 drives), works for me. Proper
patch below.

From: Jens Axboe <axboe@kernel.dk>
Subject: [PATCH] nvme: limit max IO size and segments for pci-e

nvme requires an sg table allocation for each request. If the request
is large, then the allocation can become quite large. For instance,
with our default software settings of 1280KB IO size, we'll need
10248 bytes of sg table. That turns into a 2nd order allocation,
which we can't always guarantee. If we fail the allocation, blk-mq
will retry it later. But there's no guarantee that we'll EVER be
able to allocate that much contigious memory.

Limit the IO size such that we never need more than a single page
of memory. That's a lot faster and more reliable. Then back that
allocation with a mempool, so that we know we'll always be able
to succeed the allocation at some point.

Signed-off-by: Jens Axboe <axboe at kernel.dk>

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21710a7460c8..46df030b2c3f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1808,6 +1808,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		u32 max_segments =
 			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1;
 
+		max_segments = min_not_zero(max_segments, ctrl->max_segments);
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
 		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
 	}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 231807cbc849..0c4a33df3b2f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -170,6 +170,7 @@ struct nvme_ctrl {
 	u64 cap;
 	u32 page_size;
 	u32 max_hw_sectors;
+	u32 max_segments;
 	u16 oncs;
 	u16 oacs;
 	u16 nssa;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..276acc045983 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -38,6 +38,13 @@
 
 #define SGES_PER_PAGE	(PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
+/*
+ * These can be higher, but we need to ensure that any command doesn't
+ * require an sg allocation that needs more than a page of data.
+ */
+#define NVME_MAX_KB_SZ	4096
+#define NVME_MAX_SEGS	127
+
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0);
 
@@ -100,6 +107,8 @@ struct nvme_dev {
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 
+	mempool_t *iod_mempool;
+
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -477,10 +486,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->use_sgl = nvme_pci_use_sgls(dev, rq);
 
 	if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
-		size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
-				iod->use_sgl);
-
-		iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
+		iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 		if (!iod->sg)
 			return BLK_STS_RESOURCE;
 	} else {
@@ -526,7 +532,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
 	}
 
 	if (iod->sg != iod->inline_sg)
-		kfree(iod->sg);
+		mempool_free(iod->sg, dev->iod_mempool);
 }
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
@@ -2280,6 +2286,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
 	free_opal_dev(dev->ctrl.opal_dev);
+	mempool_destroy(dev->iod_mempool);
 	kfree(dev);
 }
 
@@ -2333,6 +2340,13 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
+	/*
+	 * Limit the max command size to prevent iod->sg allocations going
+	 * over a single page.
+	 */
+	dev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
+	dev->ctrl.max_segments = NVME_MAX_SEGS;
+
 	result = nvme_init_identify(&dev->ctrl);
 	if (result)
 		goto out;
@@ -2509,6 +2523,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
 	unsigned long quirks = id->driver_data;
+	size_t alloc_size;
 
 	node = dev_to_node(&pdev->dev);
 	if (node == NUMA_NO_NODE)
@@ -2546,6 +2561,23 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
+	/*
+	 * Double check that our mempool alloc size will cover the biggest
+	 * command we support.
+	 */
+	alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ,
+						NVME_MAX_SEGS, true);
+	WARN_ON_ONCE(alloc_size > PAGE_SIZE);
+
+	dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
+						mempool_kfree,
+						(void *) alloc_size,
+						GFP_KERNEL, node);
+	if (!dev->iod_mempool) {
+		result = -ENOMEM;
+		goto release_pools;
+	}
+
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	nvme_get_ctrl(&dev->ctrl);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:49                           ` Jens Axboe
@ 2018-06-21 16:27                             ` Keith Busch
  2018-06-21 17:20                               ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Keith Busch @ 2018-06-21 16:27 UTC (permalink / raw)


On Thu, Jun 21, 2018@09:49:37AM -0600, Jens Axboe wrote:
> From: Jens Axboe <axboe at kernel.dk>
> Subject: [PATCH] nvme: limit max IO size and segments for pci-e
> 
> nvme requires an sg table allocation for each request. If the request
> is large, then the allocation can become quite large. For instance,
> with our default software settings of 1280KB IO size, we'll need
> 10248 bytes of sg table. That turns into a 2nd order allocation,
> which we can't always guarantee. If we fail the allocation, blk-mq
> will retry it later. But there's no guarantee that we'll EVER be
> able to allocate that much contigious memory.
> 
> Limit the IO size such that we never need more than a single page
> of memory. That's a lot faster and more reliable. Then back that
> allocation with a mempool, so that we know we'll always be able
> to succeed the allocation at some point.
> 
> Signed-off-by: Jens Axboe <axboe at kernel.dk>

This looks good to me. Thanks!

Acked-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 16:27                             ` Keith Busch
@ 2018-06-21 17:20                               ` Jens Axboe
  2018-06-21 17:54                                 ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 17:20 UTC (permalink / raw)


On 6/21/18 10:27 AM, Keith Busch wrote:
> On Thu, Jun 21, 2018@09:49:37AM -0600, Jens Axboe wrote:
>> From: Jens Axboe <axboe at kernel.dk>
>> Subject: [PATCH] nvme: limit max IO size and segments for pci-e
>>
>> nvme requires an sg table allocation for each request. If the request
>> is large, then the allocation can become quite large. For instance,
>> with our default software settings of 1280KB IO size, we'll need
>> 10248 bytes of sg table. That turns into a 2nd order allocation,
>> which we can't always guarantee. If we fail the allocation, blk-mq
>> will retry it later. But there's no guarantee that we'll EVER be
>> able to allocate that much contigious memory.
>>
>> Limit the IO size such that we never need more than a single page
>> of memory. That's a lot faster and more reliable. Then back that
>> allocation with a mempool, so that we know we'll always be able
>> to succeed the allocation at some point.
>>
>> Signed-off-by: Jens Axboe <axboe at kernel.dk>
> 
> This looks good to me. Thanks!
> 
> Acked-by: Keith Busch <keith.busch at intel.com>

Thanks Keith - Christoph/Keith, where do you want to queue this up?
I can do it directly, or we can do it through the nvme tree.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 17:20                               ` Jens Axboe
@ 2018-06-21 17:54                                 ` Christoph Hellwig
  2018-06-21 19:18                                   ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 17:54 UTC (permalink / raw)


On Thu, Jun 21, 2018@11:20:05AM -0600, Jens Axboe wrote:
> Thanks Keith - Christoph/Keith, where do you want to queue this up?
> I can do it directly, or we can do it through the nvme tree.

I've queued it up in the nvme tree already.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 17:54                                 ` Christoph Hellwig
@ 2018-06-21 19:18                                   ` Jens Axboe
  0 siblings, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 19:18 UTC (permalink / raw)


On 6/21/18 11:54 AM, Christoph Hellwig wrote:
> On Thu, Jun 21, 2018@11:20:05AM -0600, Jens Axboe wrote:
>> Thanks Keith - Christoph/Keith, where do you want to queue this up?
>> I can do it directly, or we can do it through the nvme tree.
> 
> I've queued it up in the nvme tree already.

Perfect, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 20:09   ` Jens Axboe
  2018-06-20 20:43     ` Keith Busch
@ 2018-06-20 20:51     ` Jens Axboe
  2018-06-21  7:15       ` Christoph Hellwig
  2018-06-21  7:09     ` Christoph Hellwig
  2 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-20 20:51 UTC (permalink / raw)


On 6/20/18 2:09 PM, Jens Axboe wrote:
> On 6/20/18 1:16 PM, Jens Axboe wrote:
>> On 6/20/18 1:10 PM, Jens Axboe wrote:
>>> This begs the question is we should limit the size in general. The
>>> command overhead is low enough that I think we should default to
>>> something sane that doesn't require _any_ > 0 order allocations.
>>>
>>> Our default 1280kb will require 10248 bytes of iod, which is an
>>> order 2 allocation.... That's not helping tail latencies in
>>> data centers, where memory is always full and fragmented.
>>
>> In terms of sizing, defaulting to 256kb might not be a bad idea.
>> That'd be 2056 bytes for a normal config. Alternatively we could
>> go 1280/4 == 320k, which would be 2568 bytes.
>>
>> In any case, I think it's something that's worth doing.
> 
> Talking more to myself - higher order allocations needed to make
> progress isn't a viable approach. There may _never_ be any available,
> which means we are stuck since this isn't backed by a mempool.
> 
> How about something like the below. Default to a size that will
> never require > PAGE_SIZE allocation, and back that with a single
> entry mempool. Even if we don't wait on it, if we keep retrying
> we'll always be able to allocate memory and make forward progress.
> And allocating a single page is a hell of a lot more likely than
> a 2nd order allocation.

This one is actually tested. As a nice benefit, the pool is node
local and we take the divisions out of the fast path since we
no longer need to calculate the iod->sg size.


diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21710a7460c8..f968c7015202 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1808,8 +1808,9 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		u32 max_segments =
 			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1;
 
+		max_segments = min_not_zero(max_segments, (u32)NVME_MAX_SEGS);
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
-		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
+		blk_queue_max_segments(q, max_segments);
 	}
 	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) &&
 	    is_power_of_2(ctrl->max_hw_sectors))
@@ -2378,7 +2379,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
 	else
-		max_hw_sectors = UINT_MAX;
+		max_hw_sectors = NVME_MAX_KB_SZ << 1;
 	ctrl->max_hw_sectors =
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 231807cbc849..efce853a6638 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -33,6 +33,13 @@ extern unsigned int admin_timeout;
 #define NVME_DEFAULT_KATO	5
 #define NVME_KATO_GRACE		10
 
+/*
+ * These can be higher, but we need to ensure that any command doesn't
+ * require an sg allocation that needs more than a page of data.
+ */
+#define NVME_MAX_KB_SZ	4096
+#define NVME_MAX_SEGS	127
+
 extern struct workqueue_struct *nvme_wq;
 extern struct workqueue_struct *nvme_reset_wq;
 extern struct workqueue_struct *nvme_delete_wq;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..d5f603b05394 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -100,6 +100,8 @@ struct nvme_dev {
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 
+	mempool_t *iod_mempool;
+
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -477,10 +479,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->use_sgl = nvme_pci_use_sgls(dev, rq);
 
 	if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
-		size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
-				iod->use_sgl);
-
-		iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
+		iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 		if (!iod->sg)
 			return BLK_STS_RESOURCE;
 	} else {
@@ -526,7 +525,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
 	}
 
 	if (iod->sg != iod->inline_sg)
-		kfree(iod->sg);
+		mempool_free(iod->sg, dev->iod_mempool);
 }
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
@@ -2280,6 +2279,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
 	free_opal_dev(dev->ctrl.opal_dev);
+	mempool_destroy(dev->iod_mempool);
 	kfree(dev);
 }
 
@@ -2509,6 +2509,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
 	unsigned long quirks = id->driver_data;
+	size_t alloc_size;
 
 	node = dev_to_node(&pdev->dev);
 	if (node == NUMA_NO_NODE)
@@ -2546,6 +2547,18 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
+	alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ,
+						NVME_MAX_SEGS, true);
+	WARN_ON_ONCE(alloc_size > PAGE_SIZE);
+	dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
+						mempool_kfree,
+						(void *) alloc_size,
+						GFP_KERNEL, node);
+	if (!dev->iod_mempool) {
+		result = -ENOMEM;
+		goto release_pools;
+	}
+
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	nvme_get_ctrl(&dev->ctrl);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 20:51     ` Jens Axboe
@ 2018-06-21  7:15       ` Christoph Hellwig
  2018-06-21 14:37         ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21  7:15 UTC (permalink / raw)


>  	else
> -		max_hw_sectors = UINT_MAX;
> +		max_hw_sectors = NVME_MAX_KB_SZ << 1;

I think this only makes sense when using PRPs.  If we use SGLs
in Fabrics, or in PCIe controllers that support them we are
only bound by the number of SGL segments and MDTS.

This also seems to miss handling the case where MDTS is set, but
above our software limit.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21  7:15       ` Christoph Hellwig
@ 2018-06-21 14:37         ` Jens Axboe
  2018-06-21 14:56           ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 14:37 UTC (permalink / raw)


On 6/21/18 1:15 AM, Christoph Hellwig wrote:
>>  	else
>> -		max_hw_sectors = UINT_MAX;
>> +		max_hw_sectors = NVME_MAX_KB_SZ << 1;
> 
> I think this only makes sense when using PRPs.  If we use SGLs
> in Fabrics, or in PCIe controllers that support them we are
> only bound by the number of SGL segments and MDTS.

The problem is that we have to limit the max size, since the
sg allocation depends on that, marginally. In reality, the segments
will dictate the max size anyway.

> This also seems to miss handling the case where MDTS is set, but
> above our software limit.

Shouldn't, we're using the min of MDTS and the software limit.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 14:37         ` Jens Axboe
@ 2018-06-21 14:56           ` Christoph Hellwig
  2018-06-21 15:01             ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 14:56 UTC (permalink / raw)


On Thu, Jun 21, 2018@08:37:28AM -0600, Jens Axboe wrote:
> On 6/21/18 1:15 AM, Christoph Hellwig wrote:
> >>  	else
> >> -		max_hw_sectors = UINT_MAX;
> >> +		max_hw_sectors = NVME_MAX_KB_SZ << 1;
> > 
> > I think this only makes sense when using PRPs.  If we use SGLs
> > in Fabrics, or in PCIe controllers that support them we are
> > only bound by the number of SGL segments and MDTS.
> 
> The problem is that we have to limit the max size, since the
> sg allocation depends on that, marginally. In reality, the segments
> will dictate the max size anyway.

The sg allocation depends on the number of segments.  The PRP
allocation (if using PRPs) depends purely on the size + a little
slack for unaligned first or last arguments.

> 
> > This also seems to miss handling the case where MDTS is set, but
> > above our software limit.
> 
> Shouldn't, we're using the min of MDTS and the software limit.

Ok, maybe I misread the patch, but it surely looked like that.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 14:56           ` Christoph Hellwig
@ 2018-06-21 15:01             ` Jens Axboe
  0 siblings, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:01 UTC (permalink / raw)


On 6/21/18 8:56 AM, Christoph Hellwig wrote:
> On Thu, Jun 21, 2018@08:37:28AM -0600, Jens Axboe wrote:
>> On 6/21/18 1:15 AM, Christoph Hellwig wrote:
>>>>  	else
>>>> -		max_hw_sectors = UINT_MAX;
>>>> +		max_hw_sectors = NVME_MAX_KB_SZ << 1;
>>>
>>> I think this only makes sense when using PRPs.  If we use SGLs
>>> in Fabrics, or in PCIe controllers that support them we are
>>> only bound by the number of SGL segments and MDTS.
>>
>> The problem is that we have to limit the max size, since the
>> sg allocation depends on that, marginally. In reality, the segments
>> will dictate the max size anyway.
> 
> The sg allocation depends on the number of segments.  The PRP
> allocation (if using PRPs) depends purely on the size + a little
> slack for unaligned first or last arguments.

Right. So for sg, we can ignore the size, and just limit the
segments.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-20 20:09   ` Jens Axboe
  2018-06-20 20:43     ` Keith Busch
  2018-06-20 20:51     ` Jens Axboe
@ 2018-06-21  7:09     ` Christoph Hellwig
  2018-06-21 14:39       ` Jens Axboe
  2 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21  7:09 UTC (permalink / raw)


On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote:
> Talking more to myself - higher order allocations needed to make
> progress isn't a viable approach. There may _never_ be any available,
> which means we are stuck since this isn't backed by a mempool.

Yes, I've complained about that a few times in the past, but so far
no one has hit it in practice.

> How about something like the below. Default to a size that will
> never require > PAGE_SIZE allocation, and back that with a single
> entry mempool. Even if we don't wait on it, if we keep retrying
> we'll always be able to allocate memory and make forward progress.
> And allocating a single page is a hell of a lot more likely than
> a 2nd order allocation.

I suspect some people might be unhappy about the limit on I/O
sizes.  That is why we want through all that effort with
chained S/G lists in SCSI, which are now also used by NVMe over
Fabrics.  But except for that we absolutely have to move to
mempools.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21  7:09     ` Christoph Hellwig
@ 2018-06-21 14:39       ` Jens Axboe
  2018-06-21 14:58         ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 14:39 UTC (permalink / raw)


On 6/21/18 1:09 AM, Christoph Hellwig wrote:
> On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote:
>> Talking more to myself - higher order allocations needed to make
>> progress isn't a viable approach. There may _never_ be any available,
>> which means we are stuck since this isn't backed by a mempool.
> 
> Yes, I've complained about that a few times in the past, but so far
> no one has hit it in practice.

Well, I have :-). The problem is that it'll only really happen in
production, since no micro testing will end up with memory fragmented
enough to pose an issue.

>> How about something like the below. Default to a size that will
>> never require > PAGE_SIZE allocation, and back that with a single
>> entry mempool. Even if we don't wait on it, if we keep retrying
>> we'll always be able to allocate memory and make forward progress.
>> And allocating a single page is a hell of a lot more likely than
>> a 2nd order allocation.
> 
> I suspect some people might be unhappy about the limit on I/O
> sizes.  That is why we want through all that effort with
> chained S/G lists in SCSI, which are now also used by NVMe over
> Fabrics.  But except for that we absolutely have to move to
> mempools.

That may be the case, but it's a lot less of an issue than SCSI is,
since the command overhead is suitably small. So I don't see a big issue
with limiting it. Besides, having to do higher order allocations for an
IO is bound to pose much bigger issues than having two smaller IOs
instead. That's especially true in production, where memory allocation
stalls are pretty frequent.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 14:39       ` Jens Axboe
@ 2018-06-21 14:58         ` Christoph Hellwig
  2018-06-21 15:02           ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 14:58 UTC (permalink / raw)


On Thu, Jun 21, 2018@08:39:47AM -0600, Jens Axboe wrote:
> That may be the case, but it's a lot less of an issue than SCSI is,
> since the command overhead is suitably small. So I don't see a big issue
> with limiting it.


Not if you are using NVMe over FC, which uses exactly the same
protocol to transfer a 64 byte NVMe SQ vs a typically 16 byte SCSI CDB..

> Besides, having to do higher order allocations for an
> IO is bound to pose much bigger issues than having two smaller IOs
> instead. That's especially true in production, where memory allocation
> stalls are pretty frequent.


But the whole point of our chained sg allocator is that we don't do
high order allocations, but multiple smaller ones and then chain
them.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 14:58         ` Christoph Hellwig
@ 2018-06-21 15:02           ` Jens Axboe
  2018-06-21 15:06             ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:02 UTC (permalink / raw)


On 6/21/18 8:58 AM, Christoph Hellwig wrote:
> On Thu, Jun 21, 2018@08:39:47AM -0600, Jens Axboe wrote:
>> That may be the case, but it's a lot less of an issue than SCSI is,
>> since the command overhead is suitably small. So I don't see a big issue
>> with limiting it.
> 
> 
> Not if you are using NVMe over FC, which uses exactly the same
> protocol to transfer a 64 byte NVMe SQ vs a typically 16 byte SCSI CDB..

I should have prefaced that with saying "for pci-e". For fabrics, I
agree.

>> Besides, having to do higher order allocations for an
>> IO is bound to pose much bigger issues than having two smaller IOs
>> instead. That's especially true in production, where memory allocation
>> stalls are pretty frequent.
> 
> 
> But the whole point of our chained sg allocator is that we don't do
> high order allocations, but multiple smaller ones and then chain
> them.

>From a quick look, the fabric code goes through the same nvme_queue_rq()
and ends up doing the same iod->sg allocation. I don't see any use of
chained sg lists, we're allocating one big sg table for the command.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:02           ` Jens Axboe
@ 2018-06-21 15:06             ` Christoph Hellwig
  2018-06-21 15:11               ` Jens Axboe
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2018-06-21 15:06 UTC (permalink / raw)


On Thu, Jun 21, 2018@09:02:18AM -0600, Jens Axboe wrote:
> >From a quick look, the fabric code goes through the same nvme_queue_rq()
> and ends up doing the same iod->sg allocation. I don't see any use of
> chained sg lists, we're allocating one big sg table for the command.

nvme_queue_rq is only used for PCIe.

E.g. for RDMA the call is:

nvme_rdma_queue_rq
 -> nvme_rdma_map_data
  -> sg_alloc_table_chained

similar for FC and loop.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] nvme: use __GFP_NOWARN for iod allocation
  2018-06-21 15:06             ` Christoph Hellwig
@ 2018-06-21 15:11               ` Jens Axboe
  0 siblings, 0 replies; 33+ messages in thread
From: Jens Axboe @ 2018-06-21 15:11 UTC (permalink / raw)


On 6/21/18 9:06 AM, Christoph Hellwig wrote:
> On Thu, Jun 21, 2018@09:02:18AM -0600, Jens Axboe wrote:
>> >From a quick look, the fabric code goes through the same nvme_queue_rq()
>> and ends up doing the same iod->sg allocation. I don't see any use of
>> chained sg lists, we're allocating one big sg table for the command.
> 
> nvme_queue_rq is only used for PCIe.
> 
> E.g. for RDMA the call is:
> 
> nvme_rdma_queue_rq
>  -> nvme_rdma_map_data
>   -> sg_alloc_table_chained
> 
> similar for FC and loop.

Gotcha. So we can safely ignore limiting segments and max size for
fabrics for now.

How about the below?


diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21710a7460c8..f1af08171d38 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1799,6 +1799,16 @@ int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_shutdown_ctrl);
 
+/*
+ * For non-fabrics, limit the max IO size and segments so that we know our
+ * iod->sg allocation can fall within a single page. Fabrics is more sensitive
+ * to command size, and it also does sg chaining to avoid large allocations.
+ */
+static bool nvme_ctrl_limit_io_size(struct nvme_ctrl *ctrl)
+{
+	return (ctrl->ops->flags & NVME_F_FABRICS) == 0;
+}
+
 static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		struct request_queue *q)
 {
@@ -1808,8 +1818,11 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		u32 max_segments =
 			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1;
 
+		if (nvme_ctrl_limit_io_size(ctrl))
+			max_segments = min_not_zero(max_segments,
+							(u32)NVME_MAX_SEGS);
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
-		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
+		blk_queue_max_segments(q, max_segments);
 	}
 	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) &&
 	    is_power_of_2(ctrl->max_hw_sectors))
@@ -2377,8 +2390,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
-	else
+	else if (!nvme_ctrl_limit_io_size(ctrl))
 		max_hw_sectors = UINT_MAX;
+	else
+		max_hw_sectors = NVME_MAX_KB_SZ << 1;
 	ctrl->max_hw_sectors =
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 231807cbc849..efce853a6638 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -33,6 +33,13 @@ extern unsigned int admin_timeout;
 #define NVME_DEFAULT_KATO	5
 #define NVME_KATO_GRACE		10
 
+/*
+ * These can be higher, but we need to ensure that any command doesn't
+ * require an sg allocation that needs more than a page of data.
+ */
+#define NVME_MAX_KB_SZ	4096
+#define NVME_MAX_SEGS	127
+
 extern struct workqueue_struct *nvme_wq;
 extern struct workqueue_struct *nvme_reset_wq;
 extern struct workqueue_struct *nvme_delete_wq;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..663446d55a49 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -100,6 +100,8 @@ struct nvme_dev {
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
 
+	mempool_t *iod_mempool;
+
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -477,10 +479,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
 	iod->use_sgl = nvme_pci_use_sgls(dev, rq);
 
 	if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
-		size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
-				iod->use_sgl);
-
-		iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
+		iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
 		if (!iod->sg)
 			return BLK_STS_RESOURCE;
 	} else {
@@ -526,7 +525,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
 	}
 
 	if (iod->sg != iod->inline_sg)
-		kfree(iod->sg);
+		mempool_free(iod->sg, dev->iod_mempool);
 }
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
@@ -2280,6 +2279,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
 	free_opal_dev(dev->ctrl.opal_dev);
+	mempool_destroy(dev->iod_mempool);
 	kfree(dev);
 }
 
@@ -2509,6 +2509,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
 	unsigned long quirks = id->driver_data;
+	size_t alloc_size;
 
 	node = dev_to_node(&pdev->dev);
 	if (node == NUMA_NO_NODE)
@@ -2546,6 +2547,23 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
+	/*
+	 * Double check that our mempool alloc size will cover the biggest
+	 * command we support.
+	 */
+	alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ,
+						NVME_MAX_SEGS, true);
+	WARN_ON_ONCE(alloc_size > PAGE_SIZE);
+
+	dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
+						mempool_kfree,
+						(void *) alloc_size,
+						GFP_KERNEL, node);
+	if (!dev->iod_mempool) {
+		result = -ENOMEM;
+		goto release_pools;
+	}
+
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	nvme_get_ctrl(&dev->ctrl);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2018-06-21 19:18 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-06-20 19:10 [PATCH] nvme: use __GFP_NOWARN for iod allocation Jens Axboe
2018-06-20 19:16 ` Jens Axboe
2018-06-20 20:09   ` Jens Axboe
2018-06-20 20:43     ` Keith Busch
2018-06-20 20:54       ` Jens Axboe
2018-06-20 22:30         ` Keith Busch
2018-06-20 22:30           ` Jens Axboe
2018-06-21  7:10             ` Christoph Hellwig
2018-06-21 14:47               ` Jens Axboe
2018-06-21 15:01                 ` Christoph Hellwig
2018-06-21 15:04                   ` Jens Axboe
2018-06-21 15:06                     ` Jens Axboe
2018-06-21 15:24                       ` Christoph Hellwig
2018-06-21 15:26                         ` Jens Axboe
2018-06-21 15:24                     ` Christoph Hellwig
2018-06-21 15:34                       ` Jens Axboe
2018-06-21 15:45                         ` Christoph Hellwig
2018-06-21 15:49                           ` Jens Axboe
2018-06-21 16:27                             ` Keith Busch
2018-06-21 17:20                               ` Jens Axboe
2018-06-21 17:54                                 ` Christoph Hellwig
2018-06-21 19:18                                   ` Jens Axboe
2018-06-20 20:51     ` Jens Axboe
2018-06-21  7:15       ` Christoph Hellwig
2018-06-21 14:37         ` Jens Axboe
2018-06-21 14:56           ` Christoph Hellwig
2018-06-21 15:01             ` Jens Axboe
2018-06-21  7:09     ` Christoph Hellwig
2018-06-21 14:39       ` Jens Axboe
2018-06-21 14:58         ` Christoph Hellwig
2018-06-21 15:02           ` Jens Axboe
2018-06-21 15:06             ` Christoph Hellwig
2018-06-21 15:11               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox