* [PATCH] nvme: use __GFP_NOWARN for iod allocation @ 2018-06-20 19:10 Jens Axboe 2018-06-20 19:16 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-20 19:10 UTC (permalink / raw) The iod can get quite large, requiring higher order allocations that we know can fail. Don't warn about them, as that just tends to flod the log with information we don't care about. Signed-off-by: Jens Axboe <axboe at kernel.dk> --- This begs the question is we should limit the size in general. The command overhead is low enough that I think we should default to something sane that doesn't require _any_ > 0 order allocations. Our default 1280kb will require 10248 bytes of iod, which is an order 2 allocation.... That's not helping tail latencies in data centers, where memory is always full and fragmented. diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..1c97d6356484 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -480,7 +480,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg, iod->use_sgl); - iod->sg = kmalloc(alloc_size, GFP_ATOMIC); + iod->sg = kmalloc(alloc_size, GFP_ATOMIC | __GFP_NOWARN); if (!iod->sg) return BLK_STS_RESOURCE; } else { -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 19:10 [PATCH] nvme: use __GFP_NOWARN for iod allocation Jens Axboe @ 2018-06-20 19:16 ` Jens Axboe 2018-06-20 20:09 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-20 19:16 UTC (permalink / raw) On 6/20/18 1:10 PM, Jens Axboe wrote: > This begs the question is we should limit the size in general. The > command overhead is low enough that I think we should default to > something sane that doesn't require _any_ > 0 order allocations. > > Our default 1280kb will require 10248 bytes of iod, which is an > order 2 allocation.... That's not helping tail latencies in > data centers, where memory is always full and fragmented. In terms of sizing, defaulting to 256kb might not be a bad idea. That'd be 2056 bytes for a normal config. Alternatively we could go 1280/4 == 320k, which would be 2568 bytes. In any case, I think it's something that's worth doing. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 19:16 ` Jens Axboe @ 2018-06-20 20:09 ` Jens Axboe 2018-06-20 20:43 ` Keith Busch ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Jens Axboe @ 2018-06-20 20:09 UTC (permalink / raw) On 6/20/18 1:16 PM, Jens Axboe wrote: > On 6/20/18 1:10 PM, Jens Axboe wrote: >> This begs the question is we should limit the size in general. The >> command overhead is low enough that I think we should default to >> something sane that doesn't require _any_ > 0 order allocations. >> >> Our default 1280kb will require 10248 bytes of iod, which is an >> order 2 allocation.... That's not helping tail latencies in >> data centers, where memory is always full and fragmented. > > In terms of sizing, defaulting to 256kb might not be a bad idea. > That'd be 2056 bytes for a normal config. Alternatively we could > go 1280/4 == 320k, which would be 2568 bytes. > > In any case, I think it's something that's worth doing. Talking more to myself - higher order allocations needed to make progress isn't a viable approach. There may _never_ be any available, which means we are stuck since this isn't backed by a mempool. How about something like the below. Default to a size that will never require > PAGE_SIZE allocation, and back that with a single entry mempool. Even if we don't wait on it, if we keep retrying we'll always be able to allocate memory and make forward progress. And allocating a single page is a hell of a lot more likely than a 2nd order allocation. diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 21710a7460c8..fe301423d06c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -37,6 +37,12 @@ #define NVME_MINORS (1U << MINORBITS) +/* + * This can be higher, but we need to ensure that any command doesn't + * require an iod->sg allocation that needs more than a page of aux data. + */ +#define NVME_MAX_SECTORS 512 + unsigned int admin_timeout = 60; module_param(admin_timeout, uint, 0644); MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands"); @@ -2378,7 +2384,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) if (id->mdts) max_hw_sectors = 1 << (id->mdts + page_shift - 9); else - max_hw_sectors = UINT_MAX; + max_hw_sectors = NVME_MAX_SECTORS; ctrl->max_hw_sectors = min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..603998e7d7a6 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -100,6 +100,8 @@ struct nvme_dev { struct nvme_ctrl ctrl; struct completion ioq_wait; + mempool_t *iod_mempool; + /* shadow doorbell buffer support: */ u32 *dbbuf_dbs; dma_addr_t dbbuf_dbs_dma_addr; @@ -477,10 +479,10 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->use_sgl = nvme_pci_use_sgls(dev, rq); if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) { - size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg, - iod->use_sgl); - - iod->sg = kmalloc(alloc_size, GFP_ATOMIC); + BUG_ON(nvme_pci_iod_alloc_size(dev, size, + blk_rq_nr_phys_segments(rq), + iod->use_sgl) > PAGE_SIZE); + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; } else { @@ -526,7 +528,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req) } if (iod->sg != iod->inline_sg) - kfree(iod->sg); + mempool_free(iod->sg, dev->iod_mempool); } #ifdef CONFIG_BLK_DEV_INTEGRITY @@ -2280,6 +2282,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) blk_put_queue(dev->ctrl.admin_q); kfree(dev->queues); free_opal_dev(dev->ctrl.opal_dev); + mempool_destroy(dev->iod_mempool); kfree(dev); } @@ -2546,6 +2549,14 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (result) goto release_pools; + dev->iod_mempool = mempool_create_node(1, mempool_alloc_pages, + mempool_free_pages, (void *) 0, + GFP_ATOMIC, node); + if (!dev->iod_mempool) { + result = -ENOMEM; + goto release_pools; + } + dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); nvme_get_ctrl(&dev->ctrl); -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 20:09 ` Jens Axboe @ 2018-06-20 20:43 ` Keith Busch 2018-06-20 20:54 ` Jens Axboe 2018-06-20 20:51 ` Jens Axboe 2018-06-21 7:09 ` Christoph Hellwig 2 siblings, 1 reply; 33+ messages in thread From: Keith Busch @ 2018-06-20 20:43 UTC (permalink / raw) On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote: > On 6/20/18 1:16 PM, Jens Axboe wrote: > > On 6/20/18 1:10 PM, Jens Axboe wrote: > >> This begs the question is we should limit the size in general. The > >> command overhead is low enough that I think we should default to > >> something sane that doesn't require _any_ > 0 order allocations. > >> > >> Our default 1280kb will require 10248 bytes of iod, which is an > >> order 2 allocation.... That's not helping tail latencies in > >> data centers, where memory is always full and fragmented. > > > > In terms of sizing, defaulting to 256kb might not be a bad idea. > > That'd be 2056 bytes for a normal config. Alternatively we could > > go 1280/4 == 320k, which would be 2568 bytes. > > > > In any case, I think it's something that's worth doing. > > Talking more to myself - higher order allocations needed to make > progress isn't a viable approach. There may _never_ be any available, > which means we are stuck since this isn't backed by a mempool. > > How about something like the below. Default to a size that will > never require > PAGE_SIZE allocation, and back that with a single > entry mempool. Even if we don't wait on it, if we keep retrying > we'll always be able to allocate memory and make forward progress. > And allocating a single page is a hell of a lot more likely than > a 2nd order allocation. If we go this route, there is a great deal of cleanup that can follow on. For example, we'll never need to chain PRPs, so nvme_iod setup/free can become a lot simpler, and the struct iod npages and hidden "iod_list" member can go away. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 20:43 ` Keith Busch @ 2018-06-20 20:54 ` Jens Axboe 2018-06-20 22:30 ` Keith Busch 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-20 20:54 UTC (permalink / raw) On 6/20/18 2:43 PM, Keith Busch wrote: > On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote: >> On 6/20/18 1:16 PM, Jens Axboe wrote: >>> On 6/20/18 1:10 PM, Jens Axboe wrote: >>>> This begs the question is we should limit the size in general. The >>>> command overhead is low enough that I think we should default to >>>> something sane that doesn't require _any_ > 0 order allocations. >>>> >>>> Our default 1280kb will require 10248 bytes of iod, which is an >>>> order 2 allocation.... That's not helping tail latencies in >>>> data centers, where memory is always full and fragmented. >>> >>> In terms of sizing, defaulting to 256kb might not be a bad idea. >>> That'd be 2056 bytes for a normal config. Alternatively we could >>> go 1280/4 == 320k, which would be 2568 bytes. >>> >>> In any case, I think it's something that's worth doing. >> >> Talking more to myself - higher order allocations needed to make >> progress isn't a viable approach. There may _never_ be any available, >> which means we are stuck since this isn't backed by a mempool. >> >> How about something like the below. Default to a size that will >> never require > PAGE_SIZE allocation, and back that with a single >> entry mempool. Even if we don't wait on it, if we keep retrying >> we'll always be able to allocate memory and make forward progress. >> And allocating a single page is a hell of a lot more likely than >> a 2nd order allocation. > > If we go this route, there is a great deal of cleanup that can follow > on. For example, we'll never need to chain PRPs, so nvme_iod setup/free > can become a lot simpler, and the struct iod npages and hidden "iod_list" > member can go away. Missed this one before sending the other one out - yeah, I agree. And since we already have the inline sg for smaller commands, I think it makes a lot of sense to move in this direction. I should also mention that this came out of a real life hang on a production box. I suppose we haven't seen a lot of these since a lot of nvme controllers have a hw limit of 128k, but the ones that don't can pretty easily hit the hang of needing 2^2 pages and failing miserably forever. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 20:54 ` Jens Axboe @ 2018-06-20 22:30 ` Keith Busch 2018-06-20 22:30 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Keith Busch @ 2018-06-20 22:30 UTC (permalink / raw) On Wed, Jun 20, 2018@02:54:26PM -0600, Jens Axboe wrote: > I should also mention that this came out of a real life hang > on a production box. I suppose we haven't seen a lot of these > since a lot of nvme controllers have a hw limit of 128k, but > the ones that don't can pretty easily hit the hang of needing > 2^2 pages and failing miserably forever. I can certainly believe that. I'm totally fine with setting a smaller upper bound on the IO size. There really isn't a meaningful bandwidth advantage to larger transfers on NVMe PCIe once you get over 256k anyway. Since you're making the change to core, let's wait to hear if there are any concerns on the fabrics front. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 22:30 ` Keith Busch @ 2018-06-20 22:30 ` Jens Axboe 2018-06-21 7:10 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-20 22:30 UTC (permalink / raw) On 6/20/18 4:30 PM, Keith Busch wrote: > On Wed, Jun 20, 2018@02:54:26PM -0600, Jens Axboe wrote: >> I should also mention that this came out of a real life hang >> on a production box. I suppose we haven't seen a lot of these >> since a lot of nvme controllers have a hw limit of 128k, but >> the ones that don't can pretty easily hit the hang of needing >> 2^2 pages and failing miserably forever. > > I can certainly believe that. I'm totally fine with setting a smaller > upper bound on the IO size. There really isn't a meaningful bandwidth > advantage to larger transfers on NVMe PCIe once you get over 256k anyway. Right. We can retain most of the maximum size there, but the segments must go down. But functionally it's the same, and we're well within the region that should be plenty fast. > Since you're making the change to core, let's wait to hear if there are > any concerns on the fabrics front. If the fabrics folks have concerns, we could make it dependent on it being a pci device. That's the case I care about anyway, and the one that's currently breaking for me. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 22:30 ` Jens Axboe @ 2018-06-21 7:10 ` Christoph Hellwig 2018-06-21 14:47 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 7:10 UTC (permalink / raw) On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote: > If the fabrics folks have concerns, we could make it dependent on > it being a pci device. That's the case I care about anyway, and the > one that's currently breaking for me. Fabrics uses the SG chain allocator that SCSI also uses, so it should already be fine, including for larger I/O sizes. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 7:10 ` Christoph Hellwig @ 2018-06-21 14:47 ` Jens Axboe 2018-06-21 15:01 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 14:47 UTC (permalink / raw) On 6/21/18 1:10 AM, Christoph Hellwig wrote: > On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote: >> If the fabrics folks have concerns, we could make it dependent on >> it being a pci device. That's the case I care about anyway, and the >> one that's currently breaking for me. > > Fabrics uses the SG chain allocator that SCSI also uses, so it > should already be fine, including for larger I/O sizes. I'll change it to ignore it for fabrics. What's the best way to check? ctrl->sqsize? -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 14:47 ` Jens Axboe @ 2018-06-21 15:01 ` Christoph Hellwig 2018-06-21 15:04 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 15:01 UTC (permalink / raw) On Thu, Jun 21, 2018@08:47:34AM -0600, Jens Axboe wrote: > On 6/21/18 1:10 AM, Christoph Hellwig wrote: > > On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote: > >> If the fabrics folks have concerns, we could make it dependent on > >> it being a pci device. That's the case I care about anyway, and the > >> one that's currently breaking for me. > > > > Fabrics uses the SG chain allocator that SCSI also uses, so it > > should already be fine, including for larger I/O sizes. > > I'll change it to ignore it for fabrics. What's the best way to > check? ctrl->sqsize? ctrl->ops->flags & NVME_F_FABRICS but in the end we probably want a separate flag, as that is an implementation detail, and not because these are fabrics drivers. I'll be spending the rest of the day on phone calls, so it'll take some time to do more on this than writing some email, but I'll try to find some time to look into these issues. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:01 ` Christoph Hellwig @ 2018-06-21 15:04 ` Jens Axboe 2018-06-21 15:06 ` Jens Axboe 2018-06-21 15:24 ` Christoph Hellwig 0 siblings, 2 replies; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:04 UTC (permalink / raw) On 6/21/18 9:01 AM, Christoph Hellwig wrote: > On Thu, Jun 21, 2018@08:47:34AM -0600, Jens Axboe wrote: >> On 6/21/18 1:10 AM, Christoph Hellwig wrote: >>> On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote: >>>> If the fabrics folks have concerns, we could make it dependent on >>>> it being a pci device. That's the case I care about anyway, and the >>>> one that's currently breaking for me. >>> >>> Fabrics uses the SG chain allocator that SCSI also uses, so it >>> should already be fine, including for larger I/O sizes. >> >> I'll change it to ignore it for fabrics. What's the best way to >> check? ctrl->sqsize? > > ctrl->ops->flags & NVME_F_FABRICS > > but in the end we probably want a separate flag, as that is an > implementation detail, and not because these are fabrics drivers. Probably fine for now with a comment, since it is pretty tightly coupled to the fact that we prefer larger commands on fabrics, and don't care so much about that for a local controller. For now the check is moot though, since we are doing the same allocations... > I'll be spending the rest of the day on phone calls, so it'll take > some time to do more on this than writing some email, but I'll try > to find some time to look into these issues. OK sounds good, would be nice to get this hashed out as I think some variant of this should go into 4.18. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:04 ` Jens Axboe @ 2018-06-21 15:06 ` Jens Axboe 2018-06-21 15:24 ` Christoph Hellwig 2018-06-21 15:24 ` Christoph Hellwig 1 sibling, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:06 UTC (permalink / raw) On 6/21/18 9:04 AM, Jens Axboe wrote: > On 6/21/18 9:01 AM, Christoph Hellwig wrote: >> On Thu, Jun 21, 2018@08:47:34AM -0600, Jens Axboe wrote: >>> On 6/21/18 1:10 AM, Christoph Hellwig wrote: >>>> On Wed, Jun 20, 2018@04:30:18PM -0600, Jens Axboe wrote: >>>>> If the fabrics folks have concerns, we could make it dependent on >>>>> it being a pci device. That's the case I care about anyway, and the >>>>> one that's currently breaking for me. >>>> >>>> Fabrics uses the SG chain allocator that SCSI also uses, so it >>>> should already be fine, including for larger I/O sizes. >>> >>> I'll change it to ignore it for fabrics. What's the best way to >>> check? ctrl->sqsize? >> >> ctrl->ops->flags & NVME_F_FABRICS >> >> but in the end we probably want a separate flag, as that is an >> implementation detail, and not because these are fabrics drivers. > > Probably fine for now with a comment, since it is pretty tightly > coupled to the fact that we prefer larger commands on fabrics, and > don't care so much about that for a local controller. > > For now the check is moot though, since we are doing the same > allocations... For size I guess it's fine, as long as we limit segments. For reference, current version below. diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 21710a7460c8..f40d46668532 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1808,8 +1808,9 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, u32 max_segments = (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1; + max_segments = min_not_zero(max_segments, (u32)NVME_MAX_SEGS); blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors); - blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX)); + blk_queue_max_segments(q, max_segments); } if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && is_power_of_2(ctrl->max_hw_sectors)) @@ -2377,8 +2378,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) ctrl->cntlid = le16_to_cpup(&id->cntlid); if (id->mdts) max_hw_sectors = 1 << (id->mdts + page_shift - 9); - else + else if (ctrl->ops->flags & NVME_F_FABRICS) max_hw_sectors = UINT_MAX; + else + max_hw_sectors = NVME_MAX_KB_SZ << 1; ctrl->max_hw_sectors = min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 231807cbc849..efce853a6638 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -33,6 +33,13 @@ extern unsigned int admin_timeout; #define NVME_DEFAULT_KATO 5 #define NVME_KATO_GRACE 10 +/* + * These can be higher, but we need to ensure that any command doesn't + * require an sg allocation that needs more than a page of data. + */ +#define NVME_MAX_KB_SZ 4096 +#define NVME_MAX_SEGS 127 + extern struct workqueue_struct *nvme_wq; extern struct workqueue_struct *nvme_reset_wq; extern struct workqueue_struct *nvme_delete_wq; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..d5f603b05394 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -100,6 +100,8 @@ struct nvme_dev { struct nvme_ctrl ctrl; struct completion ioq_wait; + mempool_t *iod_mempool; + /* shadow doorbell buffer support: */ u32 *dbbuf_dbs; dma_addr_t dbbuf_dbs_dma_addr; @@ -477,10 +479,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->use_sgl = nvme_pci_use_sgls(dev, rq); if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) { - size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg, - iod->use_sgl); - - iod->sg = kmalloc(alloc_size, GFP_ATOMIC); + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; } else { @@ -526,7 +525,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req) } if (iod->sg != iod->inline_sg) - kfree(iod->sg); + mempool_free(iod->sg, dev->iod_mempool); } #ifdef CONFIG_BLK_DEV_INTEGRITY @@ -2280,6 +2279,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) blk_put_queue(dev->ctrl.admin_q); kfree(dev->queues); free_opal_dev(dev->ctrl.opal_dev); + mempool_destroy(dev->iod_mempool); kfree(dev); } @@ -2509,6 +2509,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) int node, result = -ENOMEM; struct nvme_dev *dev; unsigned long quirks = id->driver_data; + size_t alloc_size; node = dev_to_node(&pdev->dev); if (node == NUMA_NO_NODE) @@ -2546,6 +2547,18 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (result) goto release_pools; + alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ, + NVME_MAX_SEGS, true); + WARN_ON_ONCE(alloc_size > PAGE_SIZE); + dev->iod_mempool = mempool_create_node(1, mempool_kmalloc, + mempool_kfree, + (void *) alloc_size, + GFP_KERNEL, node); + if (!dev->iod_mempool) { + result = -ENOMEM; + goto release_pools; + } + dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); nvme_get_ctrl(&dev->ctrl); -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:06 ` Jens Axboe @ 2018-06-21 15:24 ` Christoph Hellwig 2018-06-21 15:26 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 15:24 UTC (permalink / raw) I guess we can live with it for now with a big comment > ctrl->cntlid = le16_to_cpup(&id->cntlid); > if (id->mdts) > max_hw_sectors = 1 << (id->mdts + page_shift - 9); > - else > + else if (ctrl->ops->flags & NVME_F_FABRICS) > max_hw_sectors = UINT_MAX; > + else > + max_hw_sectors = NVME_MAX_KB_SZ << 1; Can this I think is better done by setting ctrl->max_hw_sectors in the PCIe driver (similar to what the other drivers already do) and then rely on the min_not_zero below to factor in MDTS. > +/* > + * These can be higher, but we need to ensure that any command doesn't > + * require an sg allocation that needs more than a page of data. > + */ > +#define NVME_MAX_KB_SZ 4096 I think this should move the pci.c > +#define NVME_MAX_SEGS 127 And this could use a good explanation. Especially as I think that we shouldn't apply the limit for fabrics. So instead we might just add a new ctrl->max_segments field, which PCI sets, and which then is min() with the otherwise calculated value. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:24 ` Christoph Hellwig @ 2018-06-21 15:26 ` Jens Axboe 0 siblings, 0 replies; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:26 UTC (permalink / raw) On 6/21/18 9:24 AM, Christoph Hellwig wrote: > I guess we can live with it for now with a big comment > >> ctrl->cntlid = le16_to_cpup(&id->cntlid); >> if (id->mdts) >> max_hw_sectors = 1 << (id->mdts + page_shift - 9); >> - else >> + else if (ctrl->ops->flags & NVME_F_FABRICS) >> max_hw_sectors = UINT_MAX; >> + else >> + max_hw_sectors = NVME_MAX_KB_SZ << 1; > > Can this I think is better done by setting ctrl->max_hw_sectors > in the PCIe driver (similar to what the other drivers already do) > and then rely on the min_not_zero below to factor in MDTS. Yeah, I can put it there, and we can get rid of the check here. Maybe do the same for segments, adding a ctrl->max_segments. If non-zero, use that in init_ctrl(). >> +/* >> + * These can be higher, but we need to ensure that any command doesn't >> + * require an sg allocation that needs more than a page of data. >> + */ >> +#define NVME_MAX_KB_SZ 4096 > > I think this should move the pci.c With the above change it can. >> +#define NVME_MAX_SEGS 127 > > And this could use a good explanation. Especially as I think > that we shouldn't apply the limit for fabrics. So instead we > might just add a new ctrl->max_segments field, which PCI sets, > and which then is min() with the otherwise calculated value. I'll add that. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:04 ` Jens Axboe 2018-06-21 15:06 ` Jens Axboe @ 2018-06-21 15:24 ` Christoph Hellwig 2018-06-21 15:34 ` Jens Axboe 1 sibling, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 15:24 UTC (permalink / raw) On Thu, Jun 21, 2018@09:04:14AM -0600, Jens Axboe wrote: > > I'll be spending the rest of the day on phone calls, so it'll take > > some time to do more on this than writing some email, but I'll try > > to find some time to look into these issues. > > OK sounds good, would be nice to get this hashed out as I think some > variant of this should go into 4.18. Fine with me, but I'd rather not rush it out this week to give us a little more time for details and aim for merging it to 4.18 next week. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:24 ` Christoph Hellwig @ 2018-06-21 15:34 ` Jens Axboe 2018-06-21 15:45 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:34 UTC (permalink / raw) On 6/21/18 9:24 AM, Christoph Hellwig wrote: > On Thu, Jun 21, 2018@09:04:14AM -0600, Jens Axboe wrote: >>> I'll be spending the rest of the day on phone calls, so it'll take >>> some time to do more on this than writing some email, but I'll try >>> to find some time to look into these issues. >> >> OK sounds good, would be nice to get this hashed out as I think some >> variant of this should go into 4.18. > > Fine with me, but I'd rather not rush it out this week to give us > a little more time for details and aim for merging it to 4.18 next > week. Oh, I'm totally fine with that, no rush outside of just wanting to ensure it ends up in 4.18. But there's plenty of time for that. Below is cleaner I think, keeping it all in pci.c and fabrics should just work. I'll test... diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 21710a7460c8..46df030b2c3f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1808,6 +1808,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, u32 max_segments = (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1; + max_segments = min_not_zero(max_segments, ctrl->max_segments); blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors); blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX)); } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 231807cbc849..0c4a33df3b2f 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -170,6 +170,7 @@ struct nvme_ctrl { u64 cap; u32 page_size; u32 max_hw_sectors; + u32 max_segments; u16 oncs; u16 oacs; u16 nssa; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..276acc045983 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -38,6 +38,13 @@ #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc)) +/* + * These can be higher, but we need to ensure that any command doesn't + * require an sg allocation that needs more than a page of data. + */ +#define NVME_MAX_KB_SZ 4096 +#define NVME_MAX_SEGS 127 + static int use_threaded_interrupts; module_param(use_threaded_interrupts, int, 0); @@ -100,6 +107,8 @@ struct nvme_dev { struct nvme_ctrl ctrl; struct completion ioq_wait; + mempool_t *iod_mempool; + /* shadow doorbell buffer support: */ u32 *dbbuf_dbs; dma_addr_t dbbuf_dbs_dma_addr; @@ -477,10 +486,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->use_sgl = nvme_pci_use_sgls(dev, rq); if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) { - size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg, - iod->use_sgl); - - iod->sg = kmalloc(alloc_size, GFP_ATOMIC); + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; } else { @@ -526,7 +532,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req) } if (iod->sg != iod->inline_sg) - kfree(iod->sg); + mempool_free(iod->sg, dev->iod_mempool); } #ifdef CONFIG_BLK_DEV_INTEGRITY @@ -2280,6 +2286,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) blk_put_queue(dev->ctrl.admin_q); kfree(dev->queues); free_opal_dev(dev->ctrl.opal_dev); + mempool_destroy(dev->iod_mempool); kfree(dev); } @@ -2333,6 +2340,13 @@ static void nvme_reset_work(struct work_struct *work) if (result) goto out; + /* + * Limit the max command size to prevent iod->sg allocations going + * over a single page. + */ + dev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1; + dev->ctrl.max_segments = NVME_MAX_SEGS; + result = nvme_init_identify(&dev->ctrl); if (result) goto out; @@ -2509,6 +2523,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) int node, result = -ENOMEM; struct nvme_dev *dev; unsigned long quirks = id->driver_data; + size_t alloc_size; node = dev_to_node(&pdev->dev); if (node == NUMA_NO_NODE) @@ -2546,6 +2561,23 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (result) goto release_pools; + /* + * Double check that our mempool alloc size will cover the biggest + * command we support. + */ + alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ, + NVME_MAX_SEGS, true); + WARN_ON_ONCE(alloc_size > PAGE_SIZE); + + dev->iod_mempool = mempool_create_node(1, mempool_kmalloc, + mempool_kfree, + (void *) alloc_size, + GFP_KERNEL, node); + if (!dev->iod_mempool) { + result = -ENOMEM; + goto release_pools; + } + dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); nvme_get_ctrl(&dev->ctrl); -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:34 ` Jens Axboe @ 2018-06-21 15:45 ` Christoph Hellwig 2018-06-21 15:49 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 15:45 UTC (permalink / raw) On Thu, Jun 21, 2018@09:34:02AM -0600, Jens Axboe wrote: > Oh, I'm totally fine with that, no rush outside of just wanting to > ensure it ends up in 4.18. But there's plenty of time for that. > > Below is cleaner I think, keeping it all in pci.c and fabrics should > just work. I'll test... Ok, this actually looks harmless enough that with successfull testing and an ack from Keith we could queue it up this week.. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:45 ` Christoph Hellwig @ 2018-06-21 15:49 ` Jens Axboe 2018-06-21 16:27 ` Keith Busch 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:49 UTC (permalink / raw) On 6/21/18 9:45 AM, Christoph Hellwig wrote: > On Thu, Jun 21, 2018@09:34:02AM -0600, Jens Axboe wrote: >> Oh, I'm totally fine with that, no rush outside of just wanting to >> ensure it ends up in 4.18. But there's plenty of time for that. >> >> Below is cleaner I think, keeping it all in pci.c and fabrics should >> just work. I'll test... > > Ok, this actually looks harmless enough that with successfull testing > and an ack from Keith we could queue it up this week.. Tested it here now (laptop and box with 6 drives), works for me. Proper patch below. From: Jens Axboe <axboe@kernel.dk> Subject: [PATCH] nvme: limit max IO size and segments for pci-e nvme requires an sg table allocation for each request. If the request is large, then the allocation can become quite large. For instance, with our default software settings of 1280KB IO size, we'll need 10248 bytes of sg table. That turns into a 2nd order allocation, which we can't always guarantee. If we fail the allocation, blk-mq will retry it later. But there's no guarantee that we'll EVER be able to allocate that much contigious memory. Limit the IO size such that we never need more than a single page of memory. That's a lot faster and more reliable. Then back that allocation with a mempool, so that we know we'll always be able to succeed the allocation at some point. Signed-off-by: Jens Axboe <axboe at kernel.dk> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 21710a7460c8..46df030b2c3f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1808,6 +1808,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, u32 max_segments = (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1; + max_segments = min_not_zero(max_segments, ctrl->max_segments); blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors); blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX)); } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 231807cbc849..0c4a33df3b2f 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -170,6 +170,7 @@ struct nvme_ctrl { u64 cap; u32 page_size; u32 max_hw_sectors; + u32 max_segments; u16 oncs; u16 oacs; u16 nssa; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..276acc045983 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -38,6 +38,13 @@ #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc)) +/* + * These can be higher, but we need to ensure that any command doesn't + * require an sg allocation that needs more than a page of data. + */ +#define NVME_MAX_KB_SZ 4096 +#define NVME_MAX_SEGS 127 + static int use_threaded_interrupts; module_param(use_threaded_interrupts, int, 0); @@ -100,6 +107,8 @@ struct nvme_dev { struct nvme_ctrl ctrl; struct completion ioq_wait; + mempool_t *iod_mempool; + /* shadow doorbell buffer support: */ u32 *dbbuf_dbs; dma_addr_t dbbuf_dbs_dma_addr; @@ -477,10 +486,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->use_sgl = nvme_pci_use_sgls(dev, rq); if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) { - size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg, - iod->use_sgl); - - iod->sg = kmalloc(alloc_size, GFP_ATOMIC); + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; } else { @@ -526,7 +532,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req) } if (iod->sg != iod->inline_sg) - kfree(iod->sg); + mempool_free(iod->sg, dev->iod_mempool); } #ifdef CONFIG_BLK_DEV_INTEGRITY @@ -2280,6 +2286,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) blk_put_queue(dev->ctrl.admin_q); kfree(dev->queues); free_opal_dev(dev->ctrl.opal_dev); + mempool_destroy(dev->iod_mempool); kfree(dev); } @@ -2333,6 +2340,13 @@ static void nvme_reset_work(struct work_struct *work) if (result) goto out; + /* + * Limit the max command size to prevent iod->sg allocations going + * over a single page. + */ + dev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1; + dev->ctrl.max_segments = NVME_MAX_SEGS; + result = nvme_init_identify(&dev->ctrl); if (result) goto out; @@ -2509,6 +2523,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) int node, result = -ENOMEM; struct nvme_dev *dev; unsigned long quirks = id->driver_data; + size_t alloc_size; node = dev_to_node(&pdev->dev); if (node == NUMA_NO_NODE) @@ -2546,6 +2561,23 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (result) goto release_pools; + /* + * Double check that our mempool alloc size will cover the biggest + * command we support. + */ + alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ, + NVME_MAX_SEGS, true); + WARN_ON_ONCE(alloc_size > PAGE_SIZE); + + dev->iod_mempool = mempool_create_node(1, mempool_kmalloc, + mempool_kfree, + (void *) alloc_size, + GFP_KERNEL, node); + if (!dev->iod_mempool) { + result = -ENOMEM; + goto release_pools; + } + dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); nvme_get_ctrl(&dev->ctrl); -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:49 ` Jens Axboe @ 2018-06-21 16:27 ` Keith Busch 2018-06-21 17:20 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Keith Busch @ 2018-06-21 16:27 UTC (permalink / raw) On Thu, Jun 21, 2018@09:49:37AM -0600, Jens Axboe wrote: > From: Jens Axboe <axboe at kernel.dk> > Subject: [PATCH] nvme: limit max IO size and segments for pci-e > > nvme requires an sg table allocation for each request. If the request > is large, then the allocation can become quite large. For instance, > with our default software settings of 1280KB IO size, we'll need > 10248 bytes of sg table. That turns into a 2nd order allocation, > which we can't always guarantee. If we fail the allocation, blk-mq > will retry it later. But there's no guarantee that we'll EVER be > able to allocate that much contigious memory. > > Limit the IO size such that we never need more than a single page > of memory. That's a lot faster and more reliable. Then back that > allocation with a mempool, so that we know we'll always be able > to succeed the allocation at some point. > > Signed-off-by: Jens Axboe <axboe at kernel.dk> This looks good to me. Thanks! Acked-by: Keith Busch <keith.busch at intel.com> ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 16:27 ` Keith Busch @ 2018-06-21 17:20 ` Jens Axboe 2018-06-21 17:54 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 17:20 UTC (permalink / raw) On 6/21/18 10:27 AM, Keith Busch wrote: > On Thu, Jun 21, 2018@09:49:37AM -0600, Jens Axboe wrote: >> From: Jens Axboe <axboe at kernel.dk> >> Subject: [PATCH] nvme: limit max IO size and segments for pci-e >> >> nvme requires an sg table allocation for each request. If the request >> is large, then the allocation can become quite large. For instance, >> with our default software settings of 1280KB IO size, we'll need >> 10248 bytes of sg table. That turns into a 2nd order allocation, >> which we can't always guarantee. If we fail the allocation, blk-mq >> will retry it later. But there's no guarantee that we'll EVER be >> able to allocate that much contigious memory. >> >> Limit the IO size such that we never need more than a single page >> of memory. That's a lot faster and more reliable. Then back that >> allocation with a mempool, so that we know we'll always be able >> to succeed the allocation at some point. >> >> Signed-off-by: Jens Axboe <axboe at kernel.dk> > > This looks good to me. Thanks! > > Acked-by: Keith Busch <keith.busch at intel.com> Thanks Keith - Christoph/Keith, where do you want to queue this up? I can do it directly, or we can do it through the nvme tree. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 17:20 ` Jens Axboe @ 2018-06-21 17:54 ` Christoph Hellwig 2018-06-21 19:18 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 17:54 UTC (permalink / raw) On Thu, Jun 21, 2018@11:20:05AM -0600, Jens Axboe wrote: > Thanks Keith - Christoph/Keith, where do you want to queue this up? > I can do it directly, or we can do it through the nvme tree. I've queued it up in the nvme tree already. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 17:54 ` Christoph Hellwig @ 2018-06-21 19:18 ` Jens Axboe 0 siblings, 0 replies; 33+ messages in thread From: Jens Axboe @ 2018-06-21 19:18 UTC (permalink / raw) On 6/21/18 11:54 AM, Christoph Hellwig wrote: > On Thu, Jun 21, 2018@11:20:05AM -0600, Jens Axboe wrote: >> Thanks Keith - Christoph/Keith, where do you want to queue this up? >> I can do it directly, or we can do it through the nvme tree. > > I've queued it up in the nvme tree already. Perfect, thanks. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 20:09 ` Jens Axboe 2018-06-20 20:43 ` Keith Busch @ 2018-06-20 20:51 ` Jens Axboe 2018-06-21 7:15 ` Christoph Hellwig 2018-06-21 7:09 ` Christoph Hellwig 2 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-20 20:51 UTC (permalink / raw) On 6/20/18 2:09 PM, Jens Axboe wrote: > On 6/20/18 1:16 PM, Jens Axboe wrote: >> On 6/20/18 1:10 PM, Jens Axboe wrote: >>> This begs the question is we should limit the size in general. The >>> command overhead is low enough that I think we should default to >>> something sane that doesn't require _any_ > 0 order allocations. >>> >>> Our default 1280kb will require 10248 bytes of iod, which is an >>> order 2 allocation.... That's not helping tail latencies in >>> data centers, where memory is always full and fragmented. >> >> In terms of sizing, defaulting to 256kb might not be a bad idea. >> That'd be 2056 bytes for a normal config. Alternatively we could >> go 1280/4 == 320k, which would be 2568 bytes. >> >> In any case, I think it's something that's worth doing. > > Talking more to myself - higher order allocations needed to make > progress isn't a viable approach. There may _never_ be any available, > which means we are stuck since this isn't backed by a mempool. > > How about something like the below. Default to a size that will > never require > PAGE_SIZE allocation, and back that with a single > entry mempool. Even if we don't wait on it, if we keep retrying > we'll always be able to allocate memory and make forward progress. > And allocating a single page is a hell of a lot more likely than > a 2nd order allocation. This one is actually tested. As a nice benefit, the pool is node local and we take the divisions out of the fast path since we no longer need to calculate the iod->sg size. diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 21710a7460c8..f968c7015202 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1808,8 +1808,9 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, u32 max_segments = (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1; + max_segments = min_not_zero(max_segments, (u32)NVME_MAX_SEGS); blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors); - blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX)); + blk_queue_max_segments(q, max_segments); } if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && is_power_of_2(ctrl->max_hw_sectors)) @@ -2378,7 +2379,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) if (id->mdts) max_hw_sectors = 1 << (id->mdts + page_shift - 9); else - max_hw_sectors = UINT_MAX; + max_hw_sectors = NVME_MAX_KB_SZ << 1; ctrl->max_hw_sectors = min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 231807cbc849..efce853a6638 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -33,6 +33,13 @@ extern unsigned int admin_timeout; #define NVME_DEFAULT_KATO 5 #define NVME_KATO_GRACE 10 +/* + * These can be higher, but we need to ensure that any command doesn't + * require an sg allocation that needs more than a page of data. + */ +#define NVME_MAX_KB_SZ 4096 +#define NVME_MAX_SEGS 127 + extern struct workqueue_struct *nvme_wq; extern struct workqueue_struct *nvme_reset_wq; extern struct workqueue_struct *nvme_delete_wq; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..d5f603b05394 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -100,6 +100,8 @@ struct nvme_dev { struct nvme_ctrl ctrl; struct completion ioq_wait; + mempool_t *iod_mempool; + /* shadow doorbell buffer support: */ u32 *dbbuf_dbs; dma_addr_t dbbuf_dbs_dma_addr; @@ -477,10 +479,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->use_sgl = nvme_pci_use_sgls(dev, rq); if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) { - size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg, - iod->use_sgl); - - iod->sg = kmalloc(alloc_size, GFP_ATOMIC); + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; } else { @@ -526,7 +525,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req) } if (iod->sg != iod->inline_sg) - kfree(iod->sg); + mempool_free(iod->sg, dev->iod_mempool); } #ifdef CONFIG_BLK_DEV_INTEGRITY @@ -2280,6 +2279,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) blk_put_queue(dev->ctrl.admin_q); kfree(dev->queues); free_opal_dev(dev->ctrl.opal_dev); + mempool_destroy(dev->iod_mempool); kfree(dev); } @@ -2509,6 +2509,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) int node, result = -ENOMEM; struct nvme_dev *dev; unsigned long quirks = id->driver_data; + size_t alloc_size; node = dev_to_node(&pdev->dev); if (node == NUMA_NO_NODE) @@ -2546,6 +2547,18 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (result) goto release_pools; + alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ, + NVME_MAX_SEGS, true); + WARN_ON_ONCE(alloc_size > PAGE_SIZE); + dev->iod_mempool = mempool_create_node(1, mempool_kmalloc, + mempool_kfree, + (void *) alloc_size, + GFP_KERNEL, node); + if (!dev->iod_mempool) { + result = -ENOMEM; + goto release_pools; + } + dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); nvme_get_ctrl(&dev->ctrl); -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 20:51 ` Jens Axboe @ 2018-06-21 7:15 ` Christoph Hellwig 2018-06-21 14:37 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 7:15 UTC (permalink / raw) > else > - max_hw_sectors = UINT_MAX; > + max_hw_sectors = NVME_MAX_KB_SZ << 1; I think this only makes sense when using PRPs. If we use SGLs in Fabrics, or in PCIe controllers that support them we are only bound by the number of SGL segments and MDTS. This also seems to miss handling the case where MDTS is set, but above our software limit. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 7:15 ` Christoph Hellwig @ 2018-06-21 14:37 ` Jens Axboe 2018-06-21 14:56 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 14:37 UTC (permalink / raw) On 6/21/18 1:15 AM, Christoph Hellwig wrote: >> else >> - max_hw_sectors = UINT_MAX; >> + max_hw_sectors = NVME_MAX_KB_SZ << 1; > > I think this only makes sense when using PRPs. If we use SGLs > in Fabrics, or in PCIe controllers that support them we are > only bound by the number of SGL segments and MDTS. The problem is that we have to limit the max size, since the sg allocation depends on that, marginally. In reality, the segments will dictate the max size anyway. > This also seems to miss handling the case where MDTS is set, but > above our software limit. Shouldn't, we're using the min of MDTS and the software limit. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 14:37 ` Jens Axboe @ 2018-06-21 14:56 ` Christoph Hellwig 2018-06-21 15:01 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 14:56 UTC (permalink / raw) On Thu, Jun 21, 2018@08:37:28AM -0600, Jens Axboe wrote: > On 6/21/18 1:15 AM, Christoph Hellwig wrote: > >> else > >> - max_hw_sectors = UINT_MAX; > >> + max_hw_sectors = NVME_MAX_KB_SZ << 1; > > > > I think this only makes sense when using PRPs. If we use SGLs > > in Fabrics, or in PCIe controllers that support them we are > > only bound by the number of SGL segments and MDTS. > > The problem is that we have to limit the max size, since the > sg allocation depends on that, marginally. In reality, the segments > will dictate the max size anyway. The sg allocation depends on the number of segments. The PRP allocation (if using PRPs) depends purely on the size + a little slack for unaligned first or last arguments. > > > This also seems to miss handling the case where MDTS is set, but > > above our software limit. > > Shouldn't, we're using the min of MDTS and the software limit. Ok, maybe I misread the patch, but it surely looked like that. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 14:56 ` Christoph Hellwig @ 2018-06-21 15:01 ` Jens Axboe 0 siblings, 0 replies; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:01 UTC (permalink / raw) On 6/21/18 8:56 AM, Christoph Hellwig wrote: > On Thu, Jun 21, 2018@08:37:28AM -0600, Jens Axboe wrote: >> On 6/21/18 1:15 AM, Christoph Hellwig wrote: >>>> else >>>> - max_hw_sectors = UINT_MAX; >>>> + max_hw_sectors = NVME_MAX_KB_SZ << 1; >>> >>> I think this only makes sense when using PRPs. If we use SGLs >>> in Fabrics, or in PCIe controllers that support them we are >>> only bound by the number of SGL segments and MDTS. >> >> The problem is that we have to limit the max size, since the >> sg allocation depends on that, marginally. In reality, the segments >> will dictate the max size anyway. > > The sg allocation depends on the number of segments. The PRP > allocation (if using PRPs) depends purely on the size + a little > slack for unaligned first or last arguments. Right. So for sg, we can ignore the size, and just limit the segments. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-20 20:09 ` Jens Axboe 2018-06-20 20:43 ` Keith Busch 2018-06-20 20:51 ` Jens Axboe @ 2018-06-21 7:09 ` Christoph Hellwig 2018-06-21 14:39 ` Jens Axboe 2 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 7:09 UTC (permalink / raw) On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote: > Talking more to myself - higher order allocations needed to make > progress isn't a viable approach. There may _never_ be any available, > which means we are stuck since this isn't backed by a mempool. Yes, I've complained about that a few times in the past, but so far no one has hit it in practice. > How about something like the below. Default to a size that will > never require > PAGE_SIZE allocation, and back that with a single > entry mempool. Even if we don't wait on it, if we keep retrying > we'll always be able to allocate memory and make forward progress. > And allocating a single page is a hell of a lot more likely than > a 2nd order allocation. I suspect some people might be unhappy about the limit on I/O sizes. That is why we want through all that effort with chained S/G lists in SCSI, which are now also used by NVMe over Fabrics. But except for that we absolutely have to move to mempools. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 7:09 ` Christoph Hellwig @ 2018-06-21 14:39 ` Jens Axboe 2018-06-21 14:58 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 14:39 UTC (permalink / raw) On 6/21/18 1:09 AM, Christoph Hellwig wrote: > On Wed, Jun 20, 2018@02:09:03PM -0600, Jens Axboe wrote: >> Talking more to myself - higher order allocations needed to make >> progress isn't a viable approach. There may _never_ be any available, >> which means we are stuck since this isn't backed by a mempool. > > Yes, I've complained about that a few times in the past, but so far > no one has hit it in practice. Well, I have :-). The problem is that it'll only really happen in production, since no micro testing will end up with memory fragmented enough to pose an issue. >> How about something like the below. Default to a size that will >> never require > PAGE_SIZE allocation, and back that with a single >> entry mempool. Even if we don't wait on it, if we keep retrying >> we'll always be able to allocate memory and make forward progress. >> And allocating a single page is a hell of a lot more likely than >> a 2nd order allocation. > > I suspect some people might be unhappy about the limit on I/O > sizes. That is why we want through all that effort with > chained S/G lists in SCSI, which are now also used by NVMe over > Fabrics. But except for that we absolutely have to move to > mempools. That may be the case, but it's a lot less of an issue than SCSI is, since the command overhead is suitably small. So I don't see a big issue with limiting it. Besides, having to do higher order allocations for an IO is bound to pose much bigger issues than having two smaller IOs instead. That's especially true in production, where memory allocation stalls are pretty frequent. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 14:39 ` Jens Axboe @ 2018-06-21 14:58 ` Christoph Hellwig 2018-06-21 15:02 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 14:58 UTC (permalink / raw) On Thu, Jun 21, 2018@08:39:47AM -0600, Jens Axboe wrote: > That may be the case, but it's a lot less of an issue than SCSI is, > since the command overhead is suitably small. So I don't see a big issue > with limiting it. Not if you are using NVMe over FC, which uses exactly the same protocol to transfer a 64 byte NVMe SQ vs a typically 16 byte SCSI CDB.. > Besides, having to do higher order allocations for an > IO is bound to pose much bigger issues than having two smaller IOs > instead. That's especially true in production, where memory allocation > stalls are pretty frequent. But the whole point of our chained sg allocator is that we don't do high order allocations, but multiple smaller ones and then chain them. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 14:58 ` Christoph Hellwig @ 2018-06-21 15:02 ` Jens Axboe 2018-06-21 15:06 ` Christoph Hellwig 0 siblings, 1 reply; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:02 UTC (permalink / raw) On 6/21/18 8:58 AM, Christoph Hellwig wrote: > On Thu, Jun 21, 2018@08:39:47AM -0600, Jens Axboe wrote: >> That may be the case, but it's a lot less of an issue than SCSI is, >> since the command overhead is suitably small. So I don't see a big issue >> with limiting it. > > > Not if you are using NVMe over FC, which uses exactly the same > protocol to transfer a 64 byte NVMe SQ vs a typically 16 byte SCSI CDB.. I should have prefaced that with saying "for pci-e". For fabrics, I agree. >> Besides, having to do higher order allocations for an >> IO is bound to pose much bigger issues than having two smaller IOs >> instead. That's especially true in production, where memory allocation >> stalls are pretty frequent. > > > But the whole point of our chained sg allocator is that we don't do > high order allocations, but multiple smaller ones and then chain > them. >From a quick look, the fabric code goes through the same nvme_queue_rq() and ends up doing the same iod->sg allocation. I don't see any use of chained sg lists, we're allocating one big sg table for the command. -- Jens Axboe ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:02 ` Jens Axboe @ 2018-06-21 15:06 ` Christoph Hellwig 2018-06-21 15:11 ` Jens Axboe 0 siblings, 1 reply; 33+ messages in thread From: Christoph Hellwig @ 2018-06-21 15:06 UTC (permalink / raw) On Thu, Jun 21, 2018@09:02:18AM -0600, Jens Axboe wrote: > >From a quick look, the fabric code goes through the same nvme_queue_rq() > and ends up doing the same iod->sg allocation. I don't see any use of > chained sg lists, we're allocating one big sg table for the command. nvme_queue_rq is only used for PCIe. E.g. for RDMA the call is: nvme_rdma_queue_rq -> nvme_rdma_map_data -> sg_alloc_table_chained similar for FC and loop. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] nvme: use __GFP_NOWARN for iod allocation 2018-06-21 15:06 ` Christoph Hellwig @ 2018-06-21 15:11 ` Jens Axboe 0 siblings, 0 replies; 33+ messages in thread From: Jens Axboe @ 2018-06-21 15:11 UTC (permalink / raw) On 6/21/18 9:06 AM, Christoph Hellwig wrote: > On Thu, Jun 21, 2018@09:02:18AM -0600, Jens Axboe wrote: >> >From a quick look, the fabric code goes through the same nvme_queue_rq() >> and ends up doing the same iod->sg allocation. I don't see any use of >> chained sg lists, we're allocating one big sg table for the command. > > nvme_queue_rq is only used for PCIe. > > E.g. for RDMA the call is: > > nvme_rdma_queue_rq > -> nvme_rdma_map_data > -> sg_alloc_table_chained > > similar for FC and loop. Gotcha. So we can safely ignore limiting segments and max size for fabrics for now. How about the below? diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 21710a7460c8..f1af08171d38 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1799,6 +1799,16 @@ int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_shutdown_ctrl); +/* + * For non-fabrics, limit the max IO size and segments so that we know our + * iod->sg allocation can fall within a single page. Fabrics is more sensitive + * to command size, and it also does sg chaining to avoid large allocations. + */ +static bool nvme_ctrl_limit_io_size(struct nvme_ctrl *ctrl) +{ + return (ctrl->ops->flags & NVME_F_FABRICS) == 0; +} + static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, struct request_queue *q) { @@ -1808,8 +1818,11 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, u32 max_segments = (ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1; + if (nvme_ctrl_limit_io_size(ctrl)) + max_segments = min_not_zero(max_segments, + (u32)NVME_MAX_SEGS); blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors); - blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX)); + blk_queue_max_segments(q, max_segments); } if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && is_power_of_2(ctrl->max_hw_sectors)) @@ -2377,8 +2390,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) ctrl->cntlid = le16_to_cpup(&id->cntlid); if (id->mdts) max_hw_sectors = 1 << (id->mdts + page_shift - 9); - else + else if (!nvme_ctrl_limit_io_size(ctrl)) max_hw_sectors = UINT_MAX; + else + max_hw_sectors = NVME_MAX_KB_SZ << 1; ctrl->max_hw_sectors = min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 231807cbc849..efce853a6638 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -33,6 +33,13 @@ extern unsigned int admin_timeout; #define NVME_DEFAULT_KATO 5 #define NVME_KATO_GRACE 10 +/* + * These can be higher, but we need to ensure that any command doesn't + * require an sg allocation that needs more than a page of data. + */ +#define NVME_MAX_KB_SZ 4096 +#define NVME_MAX_SEGS 127 + extern struct workqueue_struct *nvme_wq; extern struct workqueue_struct *nvme_reset_wq; extern struct workqueue_struct *nvme_delete_wq; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..663446d55a49 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -100,6 +100,8 @@ struct nvme_dev { struct nvme_ctrl ctrl; struct completion ioq_wait; + mempool_t *iod_mempool; + /* shadow doorbell buffer support: */ u32 *dbbuf_dbs; dma_addr_t dbbuf_dbs_dma_addr; @@ -477,10 +479,7 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->use_sgl = nvme_pci_use_sgls(dev, rq); if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) { - size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg, - iod->use_sgl); - - iod->sg = kmalloc(alloc_size, GFP_ATOMIC); + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; } else { @@ -526,7 +525,7 @@ static void nvme_free_iod(struct nvme_dev *dev, struct request *req) } if (iod->sg != iod->inline_sg) - kfree(iod->sg); + mempool_free(iod->sg, dev->iod_mempool); } #ifdef CONFIG_BLK_DEV_INTEGRITY @@ -2280,6 +2279,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl) blk_put_queue(dev->ctrl.admin_q); kfree(dev->queues); free_opal_dev(dev->ctrl.opal_dev); + mempool_destroy(dev->iod_mempool); kfree(dev); } @@ -2509,6 +2509,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) int node, result = -ENOMEM; struct nvme_dev *dev; unsigned long quirks = id->driver_data; + size_t alloc_size; node = dev_to_node(&pdev->dev); if (node == NUMA_NO_NODE) @@ -2546,6 +2547,23 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (result) goto release_pools; + /* + * Double check that our mempool alloc size will cover the biggest + * command we support. + */ + alloc_size = nvme_pci_iod_alloc_size(dev, NVME_MAX_KB_SZ, + NVME_MAX_SEGS, true); + WARN_ON_ONCE(alloc_size > PAGE_SIZE); + + dev->iod_mempool = mempool_create_node(1, mempool_kmalloc, + mempool_kfree, + (void *) alloc_size, + GFP_KERNEL, node); + if (!dev->iod_mempool) { + result = -ENOMEM; + goto release_pools; + } + dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); nvme_get_ctrl(&dev->ctrl); -- Jens Axboe ^ permalink raw reply related [flat|nested] 33+ messages in thread
end of thread, other threads:[~2018-06-21 19:18 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-20 19:10 [PATCH] nvme: use __GFP_NOWARN for iod allocation Jens Axboe 2018-06-20 19:16 ` Jens Axboe 2018-06-20 20:09 ` Jens Axboe 2018-06-20 20:43 ` Keith Busch 2018-06-20 20:54 ` Jens Axboe 2018-06-20 22:30 ` Keith Busch 2018-06-20 22:30 ` Jens Axboe 2018-06-21 7:10 ` Christoph Hellwig 2018-06-21 14:47 ` Jens Axboe 2018-06-21 15:01 ` Christoph Hellwig 2018-06-21 15:04 ` Jens Axboe 2018-06-21 15:06 ` Jens Axboe 2018-06-21 15:24 ` Christoph Hellwig 2018-06-21 15:26 ` Jens Axboe 2018-06-21 15:24 ` Christoph Hellwig 2018-06-21 15:34 ` Jens Axboe 2018-06-21 15:45 ` Christoph Hellwig 2018-06-21 15:49 ` Jens Axboe 2018-06-21 16:27 ` Keith Busch 2018-06-21 17:20 ` Jens Axboe 2018-06-21 17:54 ` Christoph Hellwig 2018-06-21 19:18 ` Jens Axboe 2018-06-20 20:51 ` Jens Axboe 2018-06-21 7:15 ` Christoph Hellwig 2018-06-21 14:37 ` Jens Axboe 2018-06-21 14:56 ` Christoph Hellwig 2018-06-21 15:01 ` Jens Axboe 2018-06-21 7:09 ` Christoph Hellwig 2018-06-21 14:39 ` Jens Axboe 2018-06-21 14:58 ` Christoph Hellwig 2018-06-21 15:02 ` Jens Axboe 2018-06-21 15:06 ` Christoph Hellwig 2018-06-21 15:11 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox