[PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
@ 2024-06-19 14:55 Hannes Reinecke
  2024-06-19 14:59 ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2024-06-19 14:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke,
	Hannes Reinecke

Select the first CPU from a given blk-mq hctx mapping
to queue the tcp workqueue item.
This avoids thread bouncing during I/O on machines with
an uneven cpu topology.
On an AMD EPYC system the performance increases from

Seq 4k write: 13.8 MB/s
Random 4k write: 11.3 MB/s
Seq 4k read: 13.9 MB/s
Random 4k read: 10.9 MB/s

to

Seq 4k write: 19.1 MB/s
Randowm 4k write: 18.1 MB/s
Seq 4k read: 15.8 MB/s
Random 4k read: 15.8 MB/s

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 43 +++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3be67c98c906..78fbce13a9e6 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1550,20 +1550,38 @@ static bool nvme_tcp_poll_queue(struct nvme_tcp_queue *queue)
 static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
 {
 	struct nvme_tcp_ctrl *ctrl = queue->ctrl;
-	int qid = nvme_tcp_queue_id(queue);
+	struct blk_mq_tag_set *set = &ctrl->tag_set;
+	int qid = nvme_tcp_queue_id(queue) - 1;
+	unsigned int *mq_map;
 	int n = 0;
 
-	if (nvme_tcp_default_queue(queue))
-		n = qid - 1;
-	else if (nvme_tcp_read_queue(queue))
-		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
-	else if (nvme_tcp_poll_queue(queue))
+	if (nvme_tcp_default_queue(queue)) {
+		mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
+		n = qid;
+	} else if (nvme_tcp_read_queue(queue)) {
+		mq_map = set->map[HCTX_TYPE_READ].mq_map;
+		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
+	} else if (nvme_tcp_poll_queue(queue)) {
+		mq_map = set->map[HCTX_TYPE_POLL].mq_map;
 		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
-				ctrl->io_queues[HCTX_TYPE_READ] - 1;
+				ctrl->io_queues[HCTX_TYPE_READ];
+	}
 	if (wq_unbound)
 		queue->io_cpu = WORK_CPU_UNBOUND;
-	else
-		queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
+	else {
+		int i;
+
+		if (WARN_ON(!mq_map))
+			return;
+		for_each_cpu(i, cpu_online_mask) {
+			if (mq_map[i] == qid) {
+				queue->io_cpu = i;
+				break;
+			}
+		}
+		dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
+			qid, queue->io_cpu);
+	}
 }
 
 static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid)
@@ -1704,7 +1722,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
 
 	queue->sock->sk->sk_allocation = GFP_ATOMIC;
 	queue->sock->sk->sk_use_task_frag = false;
-	nvme_tcp_set_queue_io_cpu(queue);
+	queue->io_cpu = WORK_CPU_UNBOUND;
 	queue->request = NULL;
 	queue->data_remaining = 0;
 	queue->ddgst_remaining = 0;
@@ -1858,9 +1876,10 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
 	nvme_tcp_init_recv_ctx(queue);
 	nvme_tcp_setup_sock_ops(queue);
 
-	if (idx)
+	if (idx) {
+		nvme_tcp_set_queue_io_cpu(queue);
 		ret = nvmf_connect_io_queue(nctrl, idx);
-	else
+	} else
 		ret = nvmf_connect_admin_queue(nctrl);
 
 	if (!ret) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-19 14:55 [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
@ 2024-06-19 14:59 ` Christoph Hellwig
  2024-06-19 15:23   ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2024-06-19 14:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, linux-nvme,
	Hannes Reinecke

>  	if (wq_unbound)
>  		queue->io_cpu = WORK_CPU_UNBOUND;

None of the above computed information is even used for the wq_unbound
code.  This would make a lot more sense if the above assignment was in
the (only) caller.

Yes, that probably should have been done when merging the wq_unbound
option (which honestly looks so whacky that I wish it wasn't merged).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-19 14:59 ` Christoph Hellwig
@ 2024-06-19 15:23   ` Hannes Reinecke
  2024-06-19 15:43     ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2024-06-19 15:23 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke; +Cc: Keith Busch, Sagi Grimberg, linux-nvme

On 6/19/24 16:59, Christoph Hellwig wrote:
>>   	if (wq_unbound)
>>   		queue->io_cpu = WORK_CPU_UNBOUND;
> 
> None of the above computed information is even used for the wq_unbound
> code.  This would make a lot more sense if the above assignment was in
> the (only) caller.
> 
> Yes, that probably should have been done when merging the wq_unbound
> option (which honestly looks so whacky that I wish it wasn't merged).
> 
Ah, you noticed?

This patch is actually got sparked off by one of our partners notifying
a severe latency increase proportional with the number of controllers.
With a marked increase when (sw) TLS is active; they even managed to run
into command timeouts.

What happens is that we shove all work from identical queue numbers onto 
the same CPU; so if your controller only exports 4 queues _all_ work 
from qid 1 (from all controllers!) is pushed onto CPU 0 causing a 
massive oversubscription on that CPU and leaving all other CPUs in the 
system idle.

Not sure how wq_unbound helps in this case; in theory the workqueue
items can be pushed on arbitrary CPUs, but that only leads to even worse
thread bouncing.

However, topic for ALPSS. We really should have some sore of 
backpressure here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-19 15:23   ` Hannes Reinecke
@ 2024-06-19 15:43     ` Sagi Grimberg
  2024-06-19 15:49       ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2024-06-19 15:43 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Hannes Reinecke
  Cc: Keith Busch, linux-nvme



On 19/06/2024 18:23, Hannes Reinecke wrote:
> On 6/19/24 16:59, Christoph Hellwig wrote:
>>>       if (wq_unbound)
>>>           queue->io_cpu = WORK_CPU_UNBOUND;
>>
>> None of the above computed information is even used for the wq_unbound
>> code.  This would make a lot more sense if the above assignment was in
>> the (only) caller.
>>
>> Yes, that probably should have been done when merging the wq_unbound
>> option (which honestly looks so whacky that I wish it wasn't merged).
>>
> Ah, you noticed?
>
> This patch is actually got sparked off by one of our partners notifying
> a severe latency increase proportional with the number of controllers.
> With a marked increase when (sw) TLS is active; they even managed to run
> into command timeouts.
>
> What happens is that we shove all work from identical queue numbers 
> onto the same CPU; so if your controller only exports 4 queues _all_ 
> work from qid 1 (from all controllers!) is pushed onto CPU 0 causing a 
> massive oversubscription on that CPU and leaving all other CPUs in the 
> system idle.

I see how you address multiple controllers falling into the same 
mappings case in your patch.
You could have selected a different mq_map entry for each controller 
(out of the entries that map to the qid).

>
> Not sure how wq_unbound helps in this case; in theory the workqueue
> items can be pushed on arbitrary CPUs, but that only leads to even worse
> thread bouncing.
>
> However, topic for ALPSS. We really should have some sore of 
> backpressure here.

I have a patch that was sitting for some time now, to make the RX path 
run directly
from softirq, which should make RX execute from the cpu core mapped to 
the RSS hash.
Perhaps you or your customer can give it a go.

disclaimer, lightly tested:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b6ea7e337eb8..02076b8cb4d8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -44,6 +44,13 @@ static bool wq_unbound;
  module_param(wq_unbound, bool, 0644);
  MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO 
context (default false)");

+/*
+ * RX context runs in softirq
+ */
+static bool softirq_rx;
+module_param(softirq_rx, bool, 0644);
+MODULE_PARM_DESC(softirq_rx, "nvme-tcp RX context in softirq");
+
  /*
   * TLS handshake timeout
   */
@@ -976,8 +983,13 @@ static void nvme_tcp_data_ready(struct sock *sk)
         read_lock_bh(&sk->sk_callback_lock);
         queue = sk->sk_user_data;
         if (likely(queue && queue->rd_enabled) &&
-           !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
-               queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+           !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
+               if (softirq_rx)
+                       nvme_tcp_try_recv_locked(queue);
+               else
+                       queue_work_on(queue->io_cpu, nvme_tcp_wq,
+                               &queue->io_work);
+       }
         read_unlock_bh(&sk->sk_callback_lock);
  }

@@ -1291,11 +1303,13 @@ static void nvme_tcp_io_work(struct work_struct *w)
                                 break;
                 }

-               result = nvme_tcp_try_recv(queue);
-               if (result > 0)
-                       pending = true;
-               else if (unlikely(result < 0))
-                       return;
+               if (!softirq_rx) {
+                       result = nvme_tcp_try_recv(queue);
+                       if (result > 0)
+                               pending = true;
+                       else if (unlikely(result < 0))
+                               return;
+               }

                 if (!pending || !queue->rd_enabled)
                         return;
--

Note, I don't yet know if this behavior should go conditional behind a 
parameter, or
simply replace the existing behavior.

Note that it depends on a patch I've sent (and will need to send a v2 for):

lore.kernel.org

[PATCH] net: micro-optimize skb_datagram_iter <#>

🔗 https://lore.kernel.org/netdev/20240617095852.66c96be9@kernel.org/T/ 
<https://lore.kernel.org/netdev/20240617095852.66c96be9@kernel.org/T/>



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-19 15:43     ` Sagi Grimberg
@ 2024-06-19 15:49       ` Hannes Reinecke
  2024-06-19 15:58         ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2024-06-19 15:49 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Hannes Reinecke; +Cc: Keith Busch, linux-nvme

On 6/19/24 17:43, Sagi Grimberg wrote:
> 
> 
> On 19/06/2024 18:23, Hannes Reinecke wrote:
>> On 6/19/24 16:59, Christoph Hellwig wrote:
>>>>       if (wq_unbound)
>>>>           queue->io_cpu = WORK_CPU_UNBOUND;
>>>
>>> None of the above computed information is even used for the wq_unbound
>>> code.  This would make a lot more sense if the above assignment was in
>>> the (only) caller.
>>>
>>> Yes, that probably should have been done when merging the wq_unbound
>>> option (which honestly looks so whacky that I wish it wasn't merged).
>>>
>> Ah, you noticed?
>>
>> This patch is actually got sparked off by one of our partners notifying
>> a severe latency increase proportional with the number of controllers.
>> With a marked increase when (sw) TLS is active; they even managed to run
>> into command timeouts.
>>
>> What happens is that we shove all work from identical queue numbers 
>> onto the same CPU; so if your controller only exports 4 queues _all_ 
>> work from qid 1 (from all controllers!) is pushed onto CPU 0 causing a 
>> massive oversubscription on that CPU and leaving all other CPUs in the 
>> system idle.
> 
> I see how you address multiple controllers falling into the same 
> mappings case in your patch.
> You could have selected a different mq_map entry for each controller 
> (out of the entries that map to the qid).
> 
Looked at it, but hadn't any idea how to figure out the load.
The load is actually per-cpu, but we only have per controller structures.
So we would need to introduce a per-cpu counter, detailing out the
number of queues scheduled on that CPU.
But that won't help with the CPU oversubscription issue; we still might 
have substantially higher number of overall queues than we have CPUs...

>>
>> Not sure how wq_unbound helps in this case; in theory the workqueue
>> items can be pushed on arbitrary CPUs, but that only leads to even worse
>> thread bouncing.
>>
>> However, topic for ALPSS. We really should have some sore of 
>> backpressure here.
> 
> I have a patch that was sitting for some time now, to make the RX path 
> run directly
> from softirq, which should make RX execute from the cpu core mapped to 
> the RSS hash.
> Perhaps you or your customer can give it a go.
> 
No s**t. That is pretty much what I wanted to do.
I'm sure to give it a go.
Thanks for that!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-19 15:49       ` Hannes Reinecke
@ 2024-06-19 15:58         ` Sagi Grimberg
  2024-06-24 10:02           ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2024-06-19 15:58 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Hannes Reinecke
  Cc: Keith Busch, linux-nvme



>> I see how you address multiple controllers falling into the same 
>> mappings case in your patch.
>> You could have selected a different mq_map entry for each controller 
>> (out of the entries that map to the qid).
>>
> Looked at it, but hadn't any idea how to figure out the load.
> The load is actually per-cpu, but we only have per controller structures.
> So we would need to introduce a per-cpu counter, detailing out the
> number of queues scheduled on that CPU.
> But that won't help with the CPU oversubscription issue; we still 
> might have substantially higher number of overall queues than we have 
> CPUs...

I think that it would still be better than what you have right now:

IIUC Right now you will have for all controllers (based on your example):
queue 1: using cpu 6
queue 2: using cpu 9
queue 3: using cpu 18

But selecting a different mq_map entry can give:
ctrl1:
queue 1: using cpu 6
queue 2: using cpu 9
queue 3: using cpu 18

ctrl2:
queue 1: using cpu 7
queue 2: using cpu 10
queue 3: using cpu 19

ctrl3:
queue 1: using cpu 8
queue 2: using cpu 11
queue 3: using cpu 20

ctrl4:
queue 1: using cpu 54
queue 2: using cpu 57
queue 3: using cpu 66

and so on...





>
>>>
>>> Not sure how wq_unbound helps in this case; in theory the workqueue
>>> items can be pushed on arbitrary CPUs, but that only leads to even 
>>> worse
>>> thread bouncing.
>>>
>>> However, topic for ALPSS. We really should have some sore of 
>>> backpressure here.
>>
>> I have a patch that was sitting for some time now, to make the RX 
>> path run directly
>> from softirq, which should make RX execute from the cpu core mapped 
>> to the RSS hash.
>> Perhaps you or your customer can give it a go.
>>
> No s**t. That is pretty much what I wanted to do.
> I'm sure to give it a go.
> Thanks for that!

You will need another prep patch for it:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3649987c0a2d..b6ea7e337eb8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -955,6 +955,18 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
*desc, struct sk_buff *skb,
         return consumed;
  }

+static int nvme_tcp_try_recv_locked(struct nvme_tcp_queue *queue)
+{
+       struct socket *sock = queue->sock;
+       struct sock *sk = sock->sk;
+       read_descriptor_t rd_desc;
+
+       rd_desc.arg.data = queue;
+       rd_desc.count = 1;
+       queue->nr_cqe = 0;
+       return sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+}
+
  static void nvme_tcp_data_ready(struct sock *sk)
  {
         struct nvme_tcp_queue *queue;
@@ -1251,16 +1263,11 @@ static int nvme_tcp_try_send(struct 
nvme_tcp_queue *queue)

  static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
  {
-       struct socket *sock = queue->sock;
-       struct sock *sk = sock->sk;
-       read_descriptor_t rd_desc;
+       struct sock *sk = queue->sock->sk;
         int consumed;

-       rd_desc.arg.data = queue;
-       rd_desc.count = 1;
         lock_sock(sk);
-       queue->nr_cqe = 0;
-       consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+       consumed = nvme_tcp_try_recv_locked(queue);
         release_sock(sk);
         return consumed;
  }
--


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-19 15:58         ` Sagi Grimberg
@ 2024-06-24 10:02           ` Sagi Grimberg
  2024-06-24 20:01             ` Kamaljit Singh
  2024-06-25  6:05             ` Hannes Reinecke
  0 siblings, 2 replies; 11+ messages in thread
From: Sagi Grimberg @ 2024-06-24 10:02 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Hannes Reinecke
  Cc: Keith Busch, linux-nvme



On 19/06/2024 18:58, Sagi Grimberg wrote:
>
>
>>> I see how you address multiple controllers falling into the same 
>>> mappings case in your patch.
>>> You could have selected a different mq_map entry for each controller 
>>> (out of the entries that map to the qid).
>>>
>> Looked at it, but hadn't any idea how to figure out the load.
>> The load is actually per-cpu, but we only have per controller 
>> structures.
>> So we would need to introduce a per-cpu counter, detailing out the
>> number of queues scheduled on that CPU.
>> But that won't help with the CPU oversubscription issue; we still 
>> might have substantially higher number of overall queues than we have 
>> CPUs...
>
> I think that it would still be better than what you have right now:
>
> IIUC Right now you will have for all controllers (based on your example):
> queue 1: using cpu 6
> queue 2: using cpu 9
> queue 3: using cpu 18
>
> But selecting a different mq_map entry can give:
> ctrl1:
> queue 1: using cpu 6
> queue 2: using cpu 9
> queue 3: using cpu 18
>
> ctrl2:
> queue 1: using cpu 7
> queue 2: using cpu 10
> queue 3: using cpu 19
>
> ctrl3:
> queue 1: using cpu 8
> queue 2: using cpu 11
> queue 3: using cpu 20
>
> ctrl4:
> queue 1: using cpu 54
> queue 2: using cpu 57
> queue 3: using cpu 66
>
> and so on...

Hey Hannes,

Did you make progress with this one?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-24 10:02           ` Sagi Grimberg
@ 2024-06-24 20:01             ` Kamaljit Singh
  2024-06-25  6:49               ` Sagi Grimberg
  2024-06-25  6:05             ` Hannes Reinecke
  1 sibling, 1 reply; 11+ messages in thread
From: Kamaljit Singh @ 2024-06-24 20:01 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, hch, Hannes Reinecke
  Cc: Keith Busch, linux-nvme@lists.infradead.org

On 19/06/2024 18:58, Sagi Grimberg wrote:
>>>> I see how you address multiple controllers falling into the same
>>>> mappings case in your patch.
>>>> You could have selected a different mq_map entry for each controller
>>>> (out of the entries that map to the qid).
>>>>
>>> Looked at it, but hadn't any idea how to figure out the load.
>>> The load is actually per-cpu, but we only have per controller
>>> structures.
>>> So we would need to introduce a per-cpu counter, detailing out the
>>> number of queues scheduled on that CPU.
>>> But that won't help with the CPU oversubscription issue; we still
>>> might have substantially higher number of overall queues than we have
>>> CPUs...
>>
>> I think that it would still be better than what you have right now:
>>
>> IIUC Right now you will have for all controllers (based on your example):
>> queue 1: using cpu 6
>> queue 2: using cpu 9
>> queue 3: using cpu 18
>>
>> But selecting a different mq_map entry can give:
>> ctrl1:
>> queue 1: using cpu 6
>> queue 2: using cpu 9
>> queue 3: using cpu 18
>>
>> ctrl2:
>> queue 1: using cpu 7
>> queue 2: using cpu 10
>> queue 3: using cpu 19
>>
>> ctrl3:
>> queue 1: using cpu 8
>> queue 2: using cpu 11
>> queue 3: using cpu 20
>>
>> ctrl4:
>> queue 1: using cpu 54
>> queue 2: using cpu 57
>> queue 3: using cpu 66
>>
>> and so on...
>
>Hey Hannes,
>
>Did you make progress with this one?

Sagi,
Looks like this change should also address the other NVMe/TCP issue that I had 
reported for WQ_UNBOUND, agree? If so, would this also be merged into 
6.8.xx branch as well?




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-24 10:02           ` Sagi Grimberg
  2024-06-24 20:01             ` Kamaljit Singh
@ 2024-06-25  6:05             ` Hannes Reinecke
  2024-06-25  6:51               ` Sagi Grimberg
  1 sibling, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2024-06-25  6:05 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Hannes Reinecke; +Cc: Keith Busch, linux-nvme

On 6/24/24 12:02, Sagi Grimberg wrote:
> 
> 
> On 19/06/2024 18:58, Sagi Grimberg wrote:
>>
>>
>>>> I see how you address multiple controllers falling into the same 
>>>> mappings case in your patch.
>>>> You could have selected a different mq_map entry for each controller 
>>>> (out of the entries that map to the qid).
>>>>
>>> Looked at it, but hadn't any idea how to figure out the load.
>>> The load is actually per-cpu, but we only have per controller 
>>> structures.
>>> So we would need to introduce a per-cpu counter, detailing out the
>>> number of queues scheduled on that CPU.
>>> But that won't help with the CPU oversubscription issue; we still 
>>> might have substantially higher number of overall queues than we have 
>>> CPUs...
>>
>> I think that it would still be better than what you have right now:
>>
>> IIUC Right now you will have for all controllers (based on your example):
>> queue 1: using cpu 6
>> queue 2: using cpu 9
>> queue 3: using cpu 18
>>
>> But selecting a different mq_map entry can give:
>> ctrl1:
>> queue 1: using cpu 6
>> queue 2: using cpu 9
>> queue 3: using cpu 18
>>
>> ctrl2:
>> queue 1: using cpu 7
>> queue 2: using cpu 10
>> queue 3: using cpu 19
>>
>> ctrl3:
>> queue 1: using cpu 8
>> queue 2: using cpu 11
>> queue 3: using cpu 20
>>
>> ctrl4:
>> queue 1: using cpu 54
>> queue 2: using cpu 57
>> queue 3: using cpu 66
>>
>> and so on...
> 
> Hey Hannes,
> 
> Did you make progress with this one?

Yeah, just trying to get some performance numbers.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-24 20:01             ` Kamaljit Singh
@ 2024-06-25  6:49               ` Sagi Grimberg
  0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2024-06-25  6:49 UTC (permalink / raw)
  To: Kamaljit Singh, Hannes Reinecke, hch, Hannes Reinecke
  Cc: Keith Busch, linux-nvme@lists.infradead.org



On 24/06/2024 23:01, Kamaljit Singh wrote:
> On 19/06/2024 18:58, Sagi Grimberg wrote:
>>>>> I see how you address multiple controllers falling into the same
>>>>> mappings case in your patch.
>>>>> You could have selected a different mq_map entry for each controller
>>>>> (out of the entries that map to the qid).
>>>>>
>>>> Looked at it, but hadn't any idea how to figure out the load.
>>>> The load is actually per-cpu, but we only have per controller
>>>> structures.
>>>> So we would need to introduce a per-cpu counter, detailing out the
>>>> number of queues scheduled on that CPU.
>>>> But that won't help with the CPU oversubscription issue; we still
>>>> might have substantially higher number of overall queues than we have
>>>> CPUs...
>>> I think that it would still be better than what you have right now:
>>>
>>> IIUC Right now you will have for all controllers (based on your example):
>>> queue 1: using cpu 6
>>> queue 2: using cpu 9
>>> queue 3: using cpu 18
>>>
>>> But selecting a different mq_map entry can give:
>>> ctrl1:
>>> queue 1: using cpu 6
>>> queue 2: using cpu 9
>>> queue 3: using cpu 18
>>>
>>> ctrl2:
>>> queue 1: using cpu 7
>>> queue 2: using cpu 10
>>> queue 3: using cpu 19
>>>
>>> ctrl3:
>>> queue 1: using cpu 8
>>> queue 2: using cpu 11
>>> queue 3: using cpu 20
>>>
>>> ctrl4:
>>> queue 1: using cpu 54
>>> queue 2: using cpu 57
>>> queue 3: using cpu 66
>>>
>>> and so on...
>> Hey Hannes,
>>
>> Did you make progress with this one?
> Sagi,
> Looks like this change should also address the other NVMe/TCP issue that I had
> reported for WQ_UNBOUND, agree?

Well, this reduces load from the nvme-tcp io_work, so it makes sense 
that it is shorter now.
However its not clear to me that we are not spending these 10ms in 
softirq in your test case.

Is it possible to measure it? with adding debug prints or trace events?

>   If so, would this also be merged into
> 6.8.xx branch as well?

I don't know if this will be considered as a bug fix. the complaint is 
advisory really, and does not affect
the stability of the driver afaiu.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-06-25  6:05             ` Hannes Reinecke
@ 2024-06-25  6:51               ` Sagi Grimberg
  0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2024-06-25  6:51 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Hannes Reinecke
  Cc: Keith Busch, linux-nvme



On 25/06/2024 9:05, Hannes Reinecke wrote:
> On 6/24/24 12:02, Sagi Grimberg wrote:
>>
>>
>> On 19/06/2024 18:58, Sagi Grimberg wrote:
>>>
>>>
>>>>> I see how you address multiple controllers falling into the same 
>>>>> mappings case in your patch.
>>>>> You could have selected a different mq_map entry for each 
>>>>> controller (out of the entries that map to the qid).
>>>>>
>>>> Looked at it, but hadn't any idea how to figure out the load.
>>>> The load is actually per-cpu, but we only have per controller 
>>>> structures.
>>>> So we would need to introduce a per-cpu counter, detailing out the
>>>> number of queues scheduled on that CPU.
>>>> But that won't help with the CPU oversubscription issue; we still 
>>>> might have substantially higher number of overall queues than we 
>>>> have CPUs...
>>>
>>> I think that it would still be better than what you have right now:
>>>
>>> IIUC Right now you will have for all controllers (based on your 
>>> example):
>>> queue 1: using cpu 6
>>> queue 2: using cpu 9
>>> queue 3: using cpu 18
>>>
>>> But selecting a different mq_map entry can give:
>>> ctrl1:
>>> queue 1: using cpu 6
>>> queue 2: using cpu 9
>>> queue 3: using cpu 18
>>>
>>> ctrl2:
>>> queue 1: using cpu 7
>>> queue 2: using cpu 10
>>> queue 3: using cpu 19
>>>
>>> ctrl3:
>>> queue 1: using cpu 8
>>> queue 2: using cpu 11
>>> queue 3: using cpu 20
>>>
>>> ctrl4:
>>> queue 1: using cpu 54
>>> queue 2: using cpu 57
>>> queue 3: using cpu 66
>>>
>>> and so on...
>>
>> Hey Hannes,
>>
>> Did you make progress with this one?
>
> Yeah, just trying to get some performance numbers.

Nice.

I understand that you took my comment above about spreading the
queue->io_cpu assignments to different cpus from the mq_map for
different controllers?

Plus, it would be good to understand the performance implications without
the softirq_rx patch, to see the performance implications of this patch 
alone.

Aside from that, I'll be also interested to see the performance 
implications of
the softirq_rx change.

>
> Cheers,
>
> Hannes



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-06-25  6:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-19 14:55 [PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-06-19 14:59 ` Christoph Hellwig
2024-06-19 15:23   ` Hannes Reinecke
2024-06-19 15:43     ` Sagi Grimberg
2024-06-19 15:49       ` Hannes Reinecke
2024-06-19 15:58         ` Sagi Grimberg
2024-06-24 10:02           ` Sagi Grimberg
2024-06-24 20:01             ` Kamaljit Singh
2024-06-25  6:49               ` Sagi Grimberg
2024-06-25  6:05             ` Hannes Reinecke
2024-06-25  6:51               ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox