* [PATCH] nvme-pci: calculate IO timeout
@ 2021-10-13 2:27 Keith Busch
2021-10-13 5:03 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Keith Busch @ 2021-10-13 2:27 UTC (permalink / raw)
To: linux-nvme, sagi, hch; +Cc: Keith Busch
Existing host and nvme device combinations are more frequently capable
of sustaining outstanding transfer sizes exceeding the driver's default
timeout tolerance, given the available device throughput.
Let's consider a "mid" level server and controller with 128 CPUs and an
NVMe controller with no MDTS limit (the driver will throttle to 4MiB).
If we assume the driver's default 1k depth per-queue, this can allow
128k outstanding IO submission queue entries.
If all SQ Entries are transferring the 4MiB max request, 512GB will be
outstanding at the same time with the default 30 second timer to
complete the entirety.
If we assume a currently modern PCIe Gen4 x4 NVMe device, that amount of
data will take ~70 seconds to transfer over the PCIe link, not
considering the device side internal latency: timeouts and IO failures
are therefore inevitable.
There are some driver options to mitigate the issue:
a) Throttle the hw queue depth
- harms high-depth single-threaded workloads
b) Throttle the number of IO queues
- harms low-depth multi-threaded workloads
c) Throttle max transfer size
- harms large sequential workloads
d) Delay dispatch based on outstanding data transfer
- requires hot-path atomics
e) Increase IO Timeout
This RFC implements option 'e', increasing the timeout. The timeout is
calculated based on the largest possible outstanding data transfer
against the device's available bandwidth. The link time is arbitrarily
doubled to allow for additional device side latency and potential link
sharing with another device.
The obvious downside to this option means it may take a long time for
the driver to notice a stuck controller.
Any other ideas?
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/pci.c | 43 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7fc992a99624..556aba525095 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2424,6 +2424,40 @@ static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
return true;
}
+static u32 nvme_calculate_timeout(struct nvme_dev *dev)
+{
+ u32 timeout;
+
+ u32 max_bytes = dev->ctrl.max_hw_sectors << SECTOR_SHIFT;
+
+ u32 max_prps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
+ u32 max_prp_lists = DIV_ROUND_UP(max_prps * sizeof(__le64),
+ NVME_CTRL_PAGE_SIZE);
+ u32 max_prp_list_size = NVME_CTRL_PAGE_SIZE * max_prp_lists;
+
+ u32 total_depth = dev->tagset.nr_hw_queues * dev->tagset.queue_depth;
+
+ /* Max outstanding NVMe data transfer scenario in MiB */
+ u32 max_xfer = (total_depth * (max_bytes +
+ sizeof(struct nvme_command) +
+ sizeof(struct nvme_completion) +
+ max_prp_list_size + 16)) >> 20;
+
+ u32 bw = pcie_bandwidth_available(to_pci_dev(dev->dev), NULL, NULL,
+ NULL);
+
+ /*
+ * PCIe overhead based on worst case MPS achieves roughy 86% link
+ * efficiency.
+ */
+ bw = bw * 86/ 100;
+ timeout = DIV_ROUND_UP(max_xfer, bw);
+
+ /* Double the time to generously allow for device side overhead */
+ return (2 * timeout) * HZ;
+
+}
+
static void nvme_dev_add(struct nvme_dev *dev)
{
int ret;
@@ -2434,7 +2468,6 @@ static void nvme_dev_add(struct nvme_dev *dev)
dev->tagset.nr_maps = 2; /* default + read */
if (dev->io_queues[HCTX_TYPE_POLL])
dev->tagset.nr_maps++;
- dev->tagset.timeout = NVME_IO_TIMEOUT;
dev->tagset.numa_node = dev->ctrl.numa_node;
dev->tagset.queue_depth = min_t(unsigned int, dev->q_depth,
BLK_MQ_MAX_DEPTH) - 1;
@@ -2442,6 +2475,14 @@ static void nvme_dev_add(struct nvme_dev *dev)
dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->tagset.driver_data = dev;
+ dev->tagset.timeout = max_t(unsigned int,
+ nvme_calculate_timeout(dev),
+ NVME_IO_TIMEOUT);
+
+ if (dev->tagset.timeout > NVME_IO_TIMEOUT)
+ dev_warn(dev->ctrl.device,
+ "max possible latency exceeds default timeout:%u; set to %u\n",
+ NVME_IO_TIMEOUT, dev->tagset.timeout);
/*
* Some Apple controllers requires tags to be unique
* across admin and IO queue, so reserve the first 32
--
2.25.4
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: calculate IO timeout
2021-10-13 2:27 [PATCH] nvme-pci: calculate IO timeout Keith Busch
@ 2021-10-13 5:03 ` Christoph Hellwig
2021-10-13 10:53 ` Sagi Grimberg
2021-10-13 15:34 ` Ming Lei
2 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2021-10-13 5:03 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, sagi, hch, axboe, linux-block
On Tue, Oct 12, 2021 at 07:27:44PM -0700, Keith Busch wrote:
> e) Increase IO Timeout
>
> This RFC implements option 'e', increasing the timeout. The timeout is
> calculated based on the largest possible outstanding data transfer
> against the device's available bandwidth. The link time is arbitrarily
> doubled to allow for additional device side latency and potential link
> sharing with another device.
>
> The obvious downside to this option means it may take a long time for
> the driver to notice a stuck controller.
Besides the timeout the amount of data in flight also means horrible
tail latencies. I suspect in the short run decrementing both the
maximum I/O size and maximum queue depth might be a good idea, preferably
based on looking at the link speed as your patch already does. That is
based on the max timeout make sure we're not likely to exceed it.
In the long run we need to be able to do some throttling based on the
amount of data in flight. I suspect blk-qos or an I/O scheduler would
be the right place for that.
>
> Any other ideas?
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> drivers/nvme/host/pci.c | 43 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 7fc992a99624..556aba525095 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2424,6 +2424,40 @@ static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
> return true;
> }
>
> +static u32 nvme_calculate_timeout(struct nvme_dev *dev)
> +{
> + u32 timeout;
> +
> + u32 max_bytes = dev->ctrl.max_hw_sectors << SECTOR_SHIFT;
> +
> + u32 max_prps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
> + u32 max_prp_lists = DIV_ROUND_UP(max_prps * sizeof(__le64),
> + NVME_CTRL_PAGE_SIZE);
> + u32 max_prp_list_size = NVME_CTRL_PAGE_SIZE * max_prp_lists;
> +
> + u32 total_depth = dev->tagset.nr_hw_queues * dev->tagset.queue_depth;
> +
> + /* Max outstanding NVMe data transfer scenario in MiB */
> + u32 max_xfer = (total_depth * (max_bytes +
> + sizeof(struct nvme_command) +
> + sizeof(struct nvme_completion) +
> + max_prp_list_size + 16)) >> 20;
> +
> + u32 bw = pcie_bandwidth_available(to_pci_dev(dev->dev), NULL, NULL,
> + NULL);
> +
> + /*
> + * PCIe overhead based on worst case MPS achieves roughy 86% link
> + * efficiency.
> + */
> + bw = bw * 86/ 100;
> + timeout = DIV_ROUND_UP(max_xfer, bw);
> +
> + /* Double the time to generously allow for device side overhead */
> + return (2 * timeout) * HZ;
> +
> +}
> +
> static void nvme_dev_add(struct nvme_dev *dev)
> {
> int ret;
> @@ -2434,7 +2468,6 @@ static void nvme_dev_add(struct nvme_dev *dev)
> dev->tagset.nr_maps = 2; /* default + read */
> if (dev->io_queues[HCTX_TYPE_POLL])
> dev->tagset.nr_maps++;
> - dev->tagset.timeout = NVME_IO_TIMEOUT;
> dev->tagset.numa_node = dev->ctrl.numa_node;
> dev->tagset.queue_depth = min_t(unsigned int, dev->q_depth,
> BLK_MQ_MAX_DEPTH) - 1;
> @@ -2442,6 +2475,14 @@ static void nvme_dev_add(struct nvme_dev *dev)
> dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
> dev->tagset.driver_data = dev;
>
> + dev->tagset.timeout = max_t(unsigned int,
> + nvme_calculate_timeout(dev),
> + NVME_IO_TIMEOUT);
> +
> + if (dev->tagset.timeout > NVME_IO_TIMEOUT)
> + dev_warn(dev->ctrl.device,
> + "max possible latency exceeds default timeout:%u; set to %u\n",
> + NVME_IO_TIMEOUT, dev->tagset.timeout);
> /*
> * Some Apple controllers requires tags to be unique
> * across admin and IO queue, so reserve the first 32
> --
> 2.25.4
---end quoted text---
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: calculate IO timeout
2021-10-13 2:27 [PATCH] nvme-pci: calculate IO timeout Keith Busch
2021-10-13 5:03 ` Christoph Hellwig
@ 2021-10-13 10:53 ` Sagi Grimberg
2021-10-13 15:34 ` Ming Lei
2 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2021-10-13 10:53 UTC (permalink / raw)
To: Keith Busch, linux-nvme, hch
> Existing host and nvme device combinations are more frequently capable
> of sustaining outstanding transfer sizes exceeding the driver's default
> timeout tolerance, given the available device throughput.
>
> Let's consider a "mid" level server and controller with 128 CPUs and an
> NVMe controller with no MDTS limit (the driver will throttle to 4MiB).
>
> If we assume the driver's default 1k depth per-queue, this can allow
> 128k outstanding IO submission queue entries.
>
> If all SQ Entries are transferring the 4MiB max request, 512GB will be
> outstanding at the same time with the default 30 second timer to
> complete the entirety.
>
> If we assume a currently modern PCIe Gen4 x4 NVMe device, that amount of
> data will take ~70 seconds to transfer over the PCIe link, not
> considering the device side internal latency: timeouts and IO failures
> are therefore inevitable.
>
> There are some driver options to mitigate the issue:
>
> a) Throttle the hw queue depth
> - harms high-depth single-threaded workloads
> b) Throttle the number of IO queues
> - harms low-depth multi-threaded workloads
> c) Throttle max transfer size
> - harms large sequential workloads
> d) Delay dispatch based on outstanding data transfer
> - requires hot-path atomics
> e) Increase IO Timeout
>
> This RFC implements option 'e', increasing the timeout. The timeout is
> calculated based on the largest possible outstanding data transfer
> against the device's available bandwidth. The link time is arbitrarily
> doubled to allow for additional device side latency and potential link
> sharing with another device.
>
> The obvious downside to this option means it may take a long time for
> the driver to notice a stuck controller.
>
> Any other ideas?
I think that the case where the workload will behave in the worst
possible case then the admin should probably override the default
manually. I don't think it is desirable to have an absolute-worst-case
default timeout.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: calculate IO timeout
2021-10-13 2:27 [PATCH] nvme-pci: calculate IO timeout Keith Busch
2021-10-13 5:03 ` Christoph Hellwig
2021-10-13 10:53 ` Sagi Grimberg
@ 2021-10-13 15:34 ` Ming Lei
2021-10-13 15:46 ` Martin K. Petersen
2021-10-13 15:53 ` Keith Busch
2 siblings, 2 replies; 6+ messages in thread
From: Ming Lei @ 2021-10-13 15:34 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, sagi, hch, Martin K. Petersen, ming.lei
On Tue, Oct 12, 2021 at 07:27:44PM -0700, Keith Busch wrote:
> Existing host and nvme device combinations are more frequently capable
> of sustaining outstanding transfer sizes exceeding the driver's default
> timeout tolerance, given the available device throughput.
>
> Let's consider a "mid" level server and controller with 128 CPUs and an
> NVMe controller with no MDTS limit (the driver will throttle to 4MiB).
>
> If we assume the driver's default 1k depth per-queue, this can allow
> 128k outstanding IO submission queue entries.
>
> If all SQ Entries are transferring the 4MiB max request, 512GB will be
> outstanding at the same time with the default 30 second timer to
> complete the entirety.
>
> If we assume a currently modern PCIe Gen4 x4 NVMe device, that amount of
> data will take ~70 seconds to transfer over the PCIe link, not
> considering the device side internal latency: timeouts and IO failures
> are therefore inevitable.
PCIe link is supposed to be much quicker than handling IOs in device side,
so nvme device should have been saturated already before using up the
PCIe link, is there any event or feedback from nvme device side(host or
device) about the saturation status?
SCSI have such mechanism so that queue depth can be adjusted according
to the feedback, and Martin is familiar with this field.
Thanks,
Ming
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: calculate IO timeout
2021-10-13 15:34 ` Ming Lei
@ 2021-10-13 15:46 ` Martin K. Petersen
2021-10-13 15:53 ` Keith Busch
1 sibling, 0 replies; 6+ messages in thread
From: Martin K. Petersen @ 2021-10-13 15:46 UTC (permalink / raw)
To: Ming Lei; +Cc: Keith Busch, linux-nvme, sagi, hch, Martin K. Petersen
Ming,
> SCSI have such mechanism so that queue depth can be adjusted according
> to the feedback, and Martin is familiar with this field.
I'm afraid that queue busy feedback is a bit of a controversial topic in
NVMe.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] nvme-pci: calculate IO timeout
2021-10-13 15:34 ` Ming Lei
2021-10-13 15:46 ` Martin K. Petersen
@ 2021-10-13 15:53 ` Keith Busch
1 sibling, 0 replies; 6+ messages in thread
From: Keith Busch @ 2021-10-13 15:53 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-nvme, sagi, hch, Martin K. Petersen
On Wed, Oct 13, 2021 at 11:34:33PM +0800, Ming Lei wrote:
> On Tue, Oct 12, 2021 at 07:27:44PM -0700, Keith Busch wrote:
> > Existing host and nvme device combinations are more frequently capable
> > of sustaining outstanding transfer sizes exceeding the driver's default
> > timeout tolerance, given the available device throughput.
> >
> > Let's consider a "mid" level server and controller with 128 CPUs and an
> > NVMe controller with no MDTS limit (the driver will throttle to 4MiB).
> >
> > If we assume the driver's default 1k depth per-queue, this can allow
> > 128k outstanding IO submission queue entries.
> >
> > If all SQ Entries are transferring the 4MiB max request, 512GB will be
> > outstanding at the same time with the default 30 second timer to
> > complete the entirety.
> >
> > If we assume a currently modern PCIe Gen4 x4 NVMe device, that amount of
> > data will take ~70 seconds to transfer over the PCIe link, not
> > considering the device side internal latency: timeouts and IO failures
> > are therefore inevitable.
>
> PCIe link is supposed to be much quicker than handling IOs in device side,
> so nvme device should have been saturated already before using up the
> PCIe link, is there any event or feedback from nvme device side(host or
> device) about the saturation status?
>
> SCSI have such mechanism so that queue depth can be adjusted according
> to the feedback, and Martin is familiar with this field.
Device side saturation should be achieved lower than the depths
considered here, and that usually happens without reaching link
saturation.
We do not really have event feedback for the NVMe driver to react to
though, so I had this patch cautiously assume 50% throughput for timeout
consideration.
I suppose we could react to the IO completion times and try to adjust
queue depths accordingly, though that is probably more aligned with a
longer term project.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-10-13 15:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-13 2:27 [PATCH] nvme-pci: calculate IO timeout Keith Busch
2021-10-13 5:03 ` Christoph Hellwig
2021-10-13 10:53 ` Sagi Grimberg
2021-10-13 15:34 ` Ming Lei
2021-10-13 15:46 ` Martin K. Petersen
2021-10-13 15:53 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox