From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>, Hannes Reinecke <hare@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>, Keith Busch <kbusch@kernel.org>,
linux-nvme@lists.infradead.org
Subject: Re: [PATCH 2/4] nvme-tcp: align I/O cpu with blk-mq mapping
Date: Thu, 4 Jul 2024 16:03:23 +0200 [thread overview]
Message-ID: <33fd9417-28e2-45aa-9b8a-c27c84f54ac7@suse.de> (raw)
In-Reply-To: <37c87a22-ef34-4ed7-ab6c-14c23b11adef@grimberg.me>
On 7/4/24 11:07, Sagi Grimberg wrote:
>
>
> On 7/4/24 09:43, Hannes Reinecke wrote:
>> On 7/3/24 21:38, Sagi Grimberg wrote:
>> [ .. ]
>>>>>
>>>>> We should make the io_cpu come from blk-mq hctx mapping by default,
>>>>> and for every controller it should use a different cpu from the
>>>>> hctx mapping. That is the default behavior. in the wq_unbound case,
>>>>> we skip all of that and make io_cpu = WORK_CPU_UNBOUND, as it was
>>>>> before.
>>>>>
>>>>> I'm not sure I follow your logic.
>>>>>
>>>> Hehe. That's quite simple: there is none :-)
>>>> I have been tinkering with that approach in the last weeks, but got
>>>> consistently _worse_ results than with the original implementation.
>>>> So I gave up on trying to make that the default.
>>>
>>> What is the "original implementation" ?
>>
>> nvme-6.10
>>
>>> What is you target? nvmet?
>>
>> nvmet with brd backend
>>
>>> What is the fio job file you are using?
>>
>> tiobench-example.fio from the fio samples
>>
>>> what is the queue count? controller count?
>>
>> 96 queues, 4 subsystems, 2 controller each.
>>
>>> What was the queue mapping?
>>>
>> queue 0-5 maps to cpu 6-11
>> queue 6-11 maps to cpu 54-59
>> queue 12-17 maps to cpu 18-23
>> queue 18-23 maps to cpu 66-71
>> queue 24-29 maps to cpu 24-29
>> queue 30-35 maps to cpu 72-77
>> queue 36-41 maps to cpu 30-35
>> queue 42-47 maps to cpu 78-83
>> queue 48-53 maps to cpu 36-41
>> queue 54-59 maps to cpu 84-89
>> queue 60-65 maps to cpu 42-47
>> queue 66-71 maps to cpu 90-95
>> queue 72-77 maps to cpu 12-17
>> queue 78-83 maps to cpu 60-65
>> queue 84-89 maps to cpu 0-5
>> queue 90-95 maps to cpu 48-53
>
> What are the io_cpu for each queue?
>
These are the io_cpu mappings; 'maps to' means the value of io_cpu.
IE queue 0 has the io_cpu value of 6.
>>
>>> Please lets NOT condition any of this on wq_unbound option at this
>>> point. This modparam was introduced to address
>>> a specific issue. If we see IO timeouts, we should fix them, not tell
>>> people to filp a modparam as a solution.
>>>
>> Thing is, there is no 'best' solution. The current implementation is
>> actually quite good in the single subsystem case. Issues start to appear
>> when doing performance testing with a really high load.
>> Reason for this is a high contention on the per-cpu workqueues, which
>> are simply overwhelmed by doing I/O _and_ servicing 'normal' OS workload
>> like writing do disk etc.
>
> What other 'normal' workloads are you seeing in your test?
>
Well, it is an OS, after all. And that OS has some housekeeping to do
like writing out the fio output etc.
>> Switching to wq_unbound reduces the contention and makes the system to
>> scale better, but that scaling leads to a performance regression for
>> the single subsystem case.
>
> The kthreads are the same kthreads, the only difference is concurrency
> management, which may take advantage of different cpu cores, but pays
> a price in latency and bouncing cpus.
>
Precisely. That's why I'm seeing a slightly better performance for the
existing implementation and single subsystems.
>> (See my other mail for performance numbers)
>> So what is 'better'?
>
> Which email? To Tejun? it seems that bound is better than unbound for
> all cases. You are suggesting that you regress with multiple controllers
> though. That is why I suggested that we _spread_ queue io_cpu
> assignments for the bound case (as I suggested in your first attempt),
> I'd want to see what happens in this case.
>
Correct; this is for the single subsystem case.
For which the current implementation is more efficient than the unbound one.
>>
>>>>
>>>>>>
>>>>>> And it makes the 'CPU hogged' messages go away, which is a bonus
>>>>>> in itself...
>>>>>
>>>>> Which messages? aren't these messages saying that the work spent
>>>>> too much time? why are you describing the case where the work does
>>>>> not get
>>>>> cpu quota to run?
>>>>
>>>> I means these messages:
>>>>
>>>> workqueue: nvme_tcp_io_work [nvme_tcp] hogged CPU for >10000us 32771
>>>> times, consider switching to WQ_UNBOUND
>>>
>>> That means that we are spending too much time in io_work, This is a
>>> separate bug. If you look at nvme_tcp_io_work it has
>>> a stop condition after 1 millisecond. However, when we call
>>> nvme_tcp_try_recv() it just keeps receiving from the socket until
>>> the socket receive buffer has no more payload. So in theory nothing
>>> prevents from the io_work from looping there forever.
>>>
>> Oh, no. It's not the loop which is the problem. It's the actual sending
>> which takes long; in my test runs I've seen about 250 requests timing
>> out, the majority of which was still pending on the send_list.
>> So the io_work function wasn't even running to fetch the requests off
>> the list.
>
> That, is unexpected. This means that either the socket buffer is full
> (is it?), or that the rx is taking a long time, taking
> away tx cpu time. I am not sure I understand why would unbound wq would
> solve either other than maybe hide it in
> some cases.
>
> In what workloads do you see this issue? reads/writes?
>
Both, unfortunately.
Funny data point, though: for 2 subsystems I see I/O timeouts when
writing the fio log to disk. There are no I/O timeouts when fio just
prints to stdout.
So it _is_ the OS which drives up starvation.
> Can you try the following patch:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0b04a2bde81d..3360de9ef034 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -350,7 +350,7 @@ static inline void nvme_tcp_advance_req(struct
> nvme_tcp_request *req,
> }
> }
>
> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue)
> {
> int ret;
>
> @@ -358,6 +358,7 @@ static inline void nvme_tcp_send_all(struct
> nvme_tcp_queue *queue)
> do {
> ret = nvme_tcp_try_send(queue);
> } while (ret > 0);
> + return ret;
> }
>
> static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue
> *queue)
> @@ -1276,7 +1277,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
> int result;
>
> if (mutex_trylock(&queue->send_mutex)) {
> - result = nvme_tcp_try_send(queue);
> + result = nvme_tcp_send_all(queue);
> mutex_unlock(&queue->send_mutex);
> if (result > 0)
> pending = true;
> --
>
> Just to understand if there is a starvation problem where in io_work()
> the tx is paced, but the rx is not.
Surprise, surprise, that does work.
Together with aligning io_cpu with the blk-mq mapping I see a
substantial improvement for the 4 subsystem case; no I/O timeouts,
and contention as measured by the number of 'nvme_tcp_io_work [nvme_tcp]
hogged CPU' messages has been reduced, too.
Performance impact is mixed; you win some, you lose some, with nothing
conclusive either way.
>>
>>> This is indeed a bug that we need to address. Probably by setting
>>> rd_desc.count to some limit, decrement it for every
>>> skb that we consume, and if we reach that limit and there are more
>>> skbs pending, we break and self-requeue.
>>>
>>> If we indeed spend much time processing a single queue in io_work, it
>>> is possible that we have a starvation problem
>>> that is escalating to the timeouts you are seeing.
>>>
>> See above; this is the problem. Most of the requests are still stuck
>> on the send_list (with some even still on the req_list) when timeouts
>> occur. This means the io_work function is not being scheduled fast
>> enough (or often enough) to fetch the requests from the list.
>
> Or, maybe some other root-cause is creating a large backlog of send
> requests.
>
What definitely contributes to the backlog is cpu time starvation.
With the original implementation I can easily trigger I/O timeouts by
writing the fio output to disk; writing to stdout pretty much avoids
the I/O timeouts.
I see to get some latency numbers for send and receive.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
next prev parent reply other threads:[~2024-07-04 14:03 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-03 13:50 [PATCH 0/4] nvme-tcp: improve scalability Hannes Reinecke
2024-07-03 13:50 ` [PATCH 1/4] nvme-tcp: per-controller I/O workqueues Hannes Reinecke
2024-07-03 14:11 ` Sagi Grimberg
2024-07-03 14:46 ` Hannes Reinecke
2024-07-03 15:16 ` Sagi Grimberg
2024-07-03 17:07 ` Tejun Heo
2024-07-03 19:14 ` Sagi Grimberg
2024-07-03 19:17 ` Tejun Heo
2024-07-03 19:41 ` Sagi Grimberg
2024-07-04 7:36 ` Hannes Reinecke
2024-07-05 7:10 ` Christoph Hellwig
2024-07-05 8:11 ` Hannes Reinecke
2024-07-05 8:16 ` Jens Axboe
2024-07-04 5:36 ` Christoph Hellwig
2024-07-03 13:50 ` [PATCH 2/4] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-07-03 14:19 ` Sagi Grimberg
2024-07-03 14:53 ` Hannes Reinecke
2024-07-03 15:03 ` Sagi Grimberg
2024-07-03 15:40 ` Hannes Reinecke
2024-07-03 19:38 ` Sagi Grimberg
2024-07-03 19:47 ` Sagi Grimberg
2024-07-04 6:43 ` Hannes Reinecke
2024-07-04 9:07 ` Sagi Grimberg
2024-07-04 14:03 ` Hannes Reinecke [this message]
2024-07-04 5:37 ` Christoph Hellwig
2024-07-04 9:13 ` Sagi Grimberg
2024-07-03 13:50 ` [PATCH 3/4] workqueue: introduce helper workqueue_unbound_affinity_scope() Hannes Reinecke
2024-07-03 17:31 ` Tejun Heo
2024-07-04 6:04 ` Hannes Reinecke
2024-07-03 13:50 ` [PATCH 4/4] nvme-tcp: switch to 'cpu' affinity scope for unbound workqueues Hannes Reinecke
2024-07-03 14:22 ` Sagi Grimberg
2024-07-03 15:01 ` Hannes Reinecke
2024-07-03 15:09 ` Sagi Grimberg
2024-07-03 15:50 ` Hannes Reinecke
2024-07-04 9:11 ` Sagi Grimberg
2024-07-04 15:54 ` Hannes Reinecke
2024-07-05 11:48 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=33fd9417-28e2-45aa-9b8a-c27c84f54ac7@suse.de \
--to=hare@suse.de \
--cc=hare@kernel.org \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox