From: Ping Gan <jacky_gam_2001@163.com>
To: sagi@grimberg.me, hch@lst.de, kch@nvidia.com,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Cc: ping.gan@dell.com
Subject: Re: [PATCH 0/2] nvmet: support polling task for RDMA and TCP
Date: Thu, 4 Jul 2024 18:35:32 +0800 [thread overview]
Message-ID: <20240704103533.68118-1-jacky_gam_2001@163.com> (raw)
In-Reply-To: <2797ce03-a52a-459a-a1a1-7591233f51cd@grimberg.me>
> On 7/4/24 11:10, Ping Gan wrote:
>>> On 02/07/2024 13:02, Ping Gan wrote:
>>>>> On 01/07/2024 10:42, Ping Gan wrote:
>>>>>>> Hey Ping Gan,
>>>>>>>
>>>>>>>
>>>>>>> On 26/06/2024 11:28, Ping Gan wrote:
>>>>>>>> When running nvmf on SMP platform, current nvme target's RDMA
>>>>>>>> and
>>>>>>>> TCP use kworker to handle IO. But if there is other high
>>>>>>>> workload
>>>>>>>> in the system(eg: on kubernetes), the competition between the
>>>>>>>> kworker and other workload is very radical. And since the
>>>>>>>> kworker
>>>>>>>> is scheduled by OS randomly, it's difficult to control OS
>>>>>>>> resource
>>>>>>>> and also tune the performance. If target support to use
>>>>>>>> delicated
>>>>>>>> polling task to handle IO, it's useful to control OS resource
>>>>>>>> and
>>>>>>>> gain good performance. So it makes sense to add polling task in
>>>>>>>> rdma-rdma and rdma-tcp modules.
>>>>>>> This is NOT the way to go here.
>>>>>>>
>>>>>>> Both rdma and tcp are driven from workqueue context, which are
>>>>>>> bound
>>>>>>> workqueues.
>>>>>>>
>>>>>>> So there are two ways to go here:
>>>>>>> 1. Add generic port cpuset and use that to direct traffic to the
>>>>>>> appropriate set of cores
>>>>>>> (i.e. select an appropriate comp_vector for rdma and add an
>>>>>>> appropriate
>>>>>>> steering rule
>>>>>>> for tcp).
>>>>>>> 2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
>>>>>>> users
>>>>>>> to
>>>>>>> control
>>>>>>> these UNBOUND workqueues cpumask via sysfs.
>>>>>>>
>>>>>>> (2) will not control interrupts to steer to other workloads
>>>>>>> cpus,
>>>>>>> but
>>>>>>> the handlers may
>>>>>>> run on a set of dedicated cpus.
>>>>>>>
>>>>>>> (1) is a better solution, but harder to implement.
>>>>>>>
>>>>>>> You also should look into nvmet-fc as well (and nvmet-loop for
>>>>>>> that
>>>>>>> matter).
>>>>>> hi Sagi Grimberg,
>>>>>> Thanks for your reply, actually we had tried the first advice you
>>>>>> suggested, but we found the performance was poor when using spdk
>>>>>> as initiator.
>>>>> I suggest that you focus on that instead of what you proposed.
>>>>> What is the source of your poor performance?
>>>> Before these patches, we had used linux's RPS to forward the
>>>> packets
>>>> to a fixed cpu set for nvmet-tcp. But when did that we can still
>>>> not
>>>> cancel the competition between softirq and workqueue since nvme
>>>> target's
>>>> kworker cpu core bind on socket's cpu which is from skb. Besides
>>>> that
>>>> we found workqueue's wait latency was very high even we enabled
>>>> polling
>>>> on nvmet-tcp by module parameter idle_poll_period_usecs. So when
>>>> initiator
>>>> is polling mode, the target of workqueue is the bottleneck. Below
>>>> is
>>>> work's wait latency trace log of our test on our cluster(per node
>>>> uses
>>>> 4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps
>>>> X
>>>> 2)
>>>> ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
>>>> system's CPU and memory were used about 80%.
>>> I'd try a simple unbound CPU case, steer packets to say cores [0-5]
>>> and
>>> assign
>>> the cpumask of the unbound workqueue to cores [6-11].
>> Okay, thanks for your guide.
>>
>>>> ogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2
>>>> 01:06:59
>>>> usecs : count distribution
>>>> 0 -> 1 : 0 | |
>>>> 2 -> 3 : 0 | |
>>>> 4 -> 7 : 0 | |
>>>> 8 -> 15 : 3 | |
>>>> 16 -> 31 : 10 | |
>>>> 32 -> 63 : 3 | |
>>>> 64 -> 127 : 2 | |
>>>> 128 -> 255 : 0 | |
>>>> 256 -> 511 : 5 | |
>>>> 512 -> 1023 : 12 | |
>>>> 1024 -> 2047 : 26 |* |
>>>> 2048 -> 4095 : 34 |* |
>>>> 4096 -> 8191 : 350 |************ |
>>>> 8192 -> 16383 : 625 |******************************|
>>>> 16384 -> 32767 : 244 |********* |
>>>> 32768 -> 65535 : 39 |* |
>>>>
>>>> 01:07:00
>>>> usecs : count distribution
>>>> 0 -> 1 : 1 | |
>>>> 2 -> 3 : 0 | |
>>>> 4 -> 7 : 4 | |
>>>> 8 -> 15 : 3 | |
>>>> 16 -> 31 : 8 | |
>>>> 32 -> 63 : 10 | |
>>>> 64 -> 127 : 3 | |
>>>> 128 -> 255 : 6 | |
>>>> 256 -> 511 : 8 | |
>>>> 512 -> 1023 : 20 |* |
>>>> 1024 -> 2047 : 19 |* |
>>>> 2048 -> 4095 : 57 |** |
>>>> 4096 -> 8191 : 325 |**************** |
>>>> 8192 -> 16383 : 647 |******************************|
>>>> 16384 -> 32767 : 228 |*********** |
>>>> 32768 -> 65535 : 43 |** |
>>>> 65536 -> 131071 : 1 | |
>>>>
>>>> And the bandwidth of a node is only 3100MB. While we used the patch
>>>> and enable 6 polling task, the bandwidth can be 4000MB. It's a good
>>>> improvement.
>>> I think you will see similar performance with unbound workqueue and
>>> rps.
>> Yes, I remodified the nvmet-tcp/nvmet-rdma code for supporting
>> unbound
>> workqueue, and in same prerequisites of above to run test, and
>> compared
>> the result of unbound workqueue and polling mode task. And I got a
>> good
>> performance for unbound workqueue. For unbound workqueue TCP we got
>> 3850M/node, it's almost equal to polling task. And also tested
>> nvmet-rdma
>> we get 5100M/node for unbound workqueue RDMA versus 5600M for polling
>> task,
>> seems the diff is very small. Anyway, your advice is good.
>
> I'm a bit surprised that you see ~10% delta here. I would look into
> what
> is the root-cause of
> this difference. If indeed the load is high, the overhead of the
> workqueue mgmt should be
> negligible. I'm assuming you used IB_POLL_UNBOUND_WORKQUEUE ?
Yes, we used IB_POLL_UNBOUND_WORKQUEUE to create ib CQ. And I observed
3% CPU
usage of unbound workqueue versus 6% of polling task.
>> Do you think
>> we
>> should submit the unbound workqueue patches for nvmet-tcp and
>> nvmet-rdma
>> to upstream nvmet?
>
> For nvmet-tcp, I think there is merit to split socket processing from
> napi context. For nvmet-rdma
> I think the only difference is if you have multiple CQs assigned with
> the same comp_vector.
>
> How many queues do you have in your test?
We used 24 IO queues to nvmet-rdma target. I think this may also be
related to workqueue's wait latency. We still see some several ms wait
latency for unbound workqueue of RMDA. You can see below trace log.
ogden-brown:~ # /usr/share/bcc/tools/wqlat -T -w ib-comp-unb-wq 1 3
Tracing work queue request latency time... Hit Ctrl-C to end.
10:09:10
usecs : count distribution
0 -> 1 : 6 | |
2 -> 3 : 105 |** |
4 -> 7 : 1732 |******************************|
8 -> 15 : 1597 |******************************|
16 -> 31 : 526 |************ |
32 -> 63 : 543 |************ |
64 -> 127 : 950 |********************* |
128 -> 255 : 1335 |***************************** |
256 -> 511 : 1534 |******************************|
512 -> 1023 : 1039 |*********************** |
1024 -> 2047 : 592 |************* |
2048 -> 4095 : 112 |** |
4096 -> 8191 : 6 | |
10:09:11
usecs : count distribution
0 -> 1 : 3 | |
2 -> 3 : 62 |* |
4 -> 7 : 1459 |***************************** |
8 -> 15 : 1869 |******************************|
16 -> 31 : 612 |************* |
32 -> 63 : 478 |********** |
64 -> 127 : 844 |****************** |
128 -> 255 : 1123 |************************ |
256 -> 511 : 1278 |*************************** |
512 -> 1023 : 1113 |*********************** |
1024 -> 2047 : 632 |************* |
2048 -> 4095 : 158 |*** |
4096 -> 8191 : 18 | |
8192 -> 16383 : 1 | |
10:09:12
usecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 68 |* |
4 -> 7 : 1399 |*************************** |
8 -> 15 : 1822 |******************************|
16 -> 31 : 559 |************ |
32 -> 63 : 513 |*********** |
64 -> 127 : 906 |******************* |
128 -> 255 : 1217 |*********************** |
256 -> 511 : 1391 |*************************** |
512 -> 1023 : 1135 |************************ |
1024 -> 2047 : 569 |************ |
2048 -> 4095 : 110 |** |
4096 -> 8191 : 26 | |
8192 -> 16383 : 11 | |
Thanks,
Ping
next prev parent reply other threads:[~2024-07-04 10:37 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-26 8:28 [PATCH 0/2] nvmet: support polling task for RDMA and TCP Ping Gan
2024-06-26 8:28 ` [PATCH 1/2] nvmet-rdma: add polling cq task for nvmet-rdma Ping Gan
2024-06-26 8:28 ` [PATCH 2/2] nvmet-tcp: add polling task for nvmet-tcp Ping Gan
2024-06-30 8:58 ` [PATCH 0/2] nvmet: support polling task for RDMA and TCP Sagi Grimberg
2024-07-01 7:42 ` Ping Gan
2024-07-01 7:42 ` Ping Gan
2024-07-01 8:22 ` Sagi Grimberg
2024-07-02 10:02 ` Ping Gan
2024-07-02 10:02 ` Ping Gan
2024-07-03 19:58 ` Sagi Grimberg
2024-07-04 8:10 ` Ping Gan
2024-07-04 8:40 ` Sagi Grimberg
2024-07-04 10:35 ` Ping Gan [this message]
2024-07-05 5:59 ` Sagi Grimberg
2024-07-05 6:28 ` Ping Gan
2024-07-16 10:36 ` Hannes Reinecke
2024-07-17 0:53 ` Ping Gan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240704103533.68118-1-jacky_gam_2001@163.com \
--to=jacky_gam_2001@163.com \
--cc=hch@lst.de \
--cc=kch@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=ping.gan@dell.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox