From: Samuel Jones <sjones@kalrayinc.com>
To: Keith Busch <kbusch@kernel.org>
Cc: linux-nvme <linux-nvme@lists.infradead.org>
Subject: Re: Command timeouts with NVMe TCP kernel driver
Date: Wed, 1 Sep 2021 00:46:52 +0200 (CEST) [thread overview]
Message-ID: <1306302519.5246175.1630450012705.JavaMail.zimbra@kalray.eu> (raw)
In-Reply-To: <20210831023712.GA4046@dhcp-10-100-145-180.wdc.com>
> On Mon, Aug 30, 2021 at 03:30:29PM +0200, Samuel Jones wrote:
>> Hi all,
>>
>> I'm experiencing command timeouts with recent versions of nvme-tcp driver. I
>> have set up with a VM running 5.8 which is ok, but the same test load (100%RD
>> 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue.
>> The initatior complains as follows:
>>
>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4
>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery
>>
>> My target is a Kalray target, but I have no trace of any outstanding commands
>> when this situation occurs. Quite the opposite: The Kalray board observes that
>> the initiator stops sending new requests on the queue some time before this
>> command times out.
>>
>> I don't have any similar issues using SPDK as an initiator. I made the following
>> modification to nvme-tcp code and my problem has disappeared:
>>
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct
>> nvme_tcp_request *req,
>> * directly, otherwise queue io_work. Also, only do that if we
>> * are on the same cpu, so we don't introduce contention.
>> */
>> +#if 0
>> if (queue->io_cpu == raw_smp_processor_id() &&
>> sync && empty && mutex_trylock(&queue->send_mutex)) {
>> queue->more_requests = !last;
>> @@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct
>> nvme_tcp_request *req,
>> } else if (last) {
>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> }
>> +#else
>> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> +#endif
>> }
>>
>> To be honest, this bit of code has always bothered me: I don't understand how we
>> can guarantee that the thread doesn't change CPU between the call to
>> raw_smp_processor_id() and the trylock. I assume that even if this does occur,
>> the fact that we hold queue->send_mutex() is supposed to make sure that there
>> are no atomicity issues, but I'm concerned about the logic that tries to
>> determine if there is any more pending work and reschedule io_work if and only
>> if there is something to do. I can't pinpoint an issue from just reading the
>> code though..
>>
>> Does anyone have any thoughts on this? Is my patch proof that there is an issue
>> in this code, or am I likely to be changing the program flow so significantly
>> that we can't conclude? Are there any open issues wrt command timeouts at the
>> moment?
>
> I've seen similar, but I thought we'd fixed all those issues. Your poc
> patch indicates to me that there is still at least another condition
> that's not being considered. I don't think the possibility of CPU
> rescheduling is the culprit, though.
>
> One thing that looks problematic is if 'bd.last' is true but
> nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
> sequence. That could cause the io_work to never kick. There should have
> been something else in dmesg indicating that error, though, so I'll look
> for some other possibility.
>
Thanks for your help on this Keith. I think I have a better idea about what is happening now. First of all, I have a large number of devices exposed over NVMe-TCP: I'm connecting to 4 remote NVMe devices each of which is multi-namespace, so I have a total of 37 block devices at once. Then I run a userspace benchmark that's running 8 threads per block device and generating IO to each of them. I'm not sure how the multi queue block devices are layered over the X NVMe queues per device, but however it works my system is on its knees. I have been timestamping the time spent by NVMe-TCP threads inside kernel_sendpage() within nvme_tcp_try_send_cmd_pdu() when called from nvme_tcp_queue_rq() and I'm getting some ridiculous outliers : up to 60 seconds. I am confident this is what is responsible for my timeouts. If I increase the nvme_io_timeout to 300 seconds, my test case passes.
What's interesting is my quick patch that removed the direct call to kernel_sendpage via queue_rq() makes the system behave much better: no more outliers and no timeouts. I don't know much about the kernel network stack or the block device layer but I imagine that the difference is due to the number of calls into the network stack that can happen in parallel. Is queue_rq executed from the context of the userspace caller? In this case I will have close to 300 threads active on this code path. Whereas when we go through the workqueue, I guess we have only N max calls into the network stack at once, where N = the number of CPUs on the system?
What do you think? Does my analysis make any sense?
Thanks
Sam
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2021-08-31 22:47 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-30 13:30 Command timeouts with NVMe TCP kernel driver Samuel Jones
2021-08-31 2:37 ` Keith Busch
2021-08-31 2:44 ` Keith Busch
2021-09-01 12:56 ` Sagi Grimberg
2021-08-31 22:46 ` Samuel Jones [this message]
2021-09-01 13:21 ` Sagi Grimberg
2021-09-07 9:05 ` Samuel Jones
2021-09-07 16:42 ` Keith Busch
2021-09-07 20:46 ` Keith Busch
2021-09-08 10:29 ` Samuel Jones
2021-09-08 20:55 ` Sagi Grimberg
2021-09-08 21:50 ` Keith Busch
2021-09-09 6:36 ` Sagi Grimberg
2021-09-09 8:01 ` Sagi Grimberg
2021-09-09 8:09 ` Samuel Jones
2021-09-09 8:54 ` Sagi Grimberg
2021-09-13 7:44 ` Samuel Jones
2021-09-13 9:15 ` Sagi Grimberg
2021-09-13 14:10 ` Keith Busch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1306302519.5246175.1630450012705.JavaMail.zimbra@kalray.eu \
--to=sjones@kalrayinc.com \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox