Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Pratyush Yadav <ptyadav@amazon.de>
To: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
	"Jens Axboe" <axboe@kernel.dk>, <linux-nvme@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none
Date: Tue, 8 Aug 2023 17:51:01 +0200	[thread overview]
Message-ID: <mafs0y1il5y8q.fsf_-_@amazon.de> (raw)
In-Reply-To: <ZM0XEUw0xKgzXvi+@kbusch-mbp> (Keith Busch's message of "Fri, 4 Aug 2023 09:19:45 -0600")

On Fri, Aug 04 2023, Keith Busch wrote:

> On Fri, Aug 04, 2023 at 04:50:16PM +0200, Pratyush Yadav wrote:
>> With this patch, I get the below affinities:
>
> Something still seems off without effective_affinity set. That attribute
> should always reflect one CPU from the smp_affinity_list.
>
> At least with your patch, the smp_affinity_list looks as expected: every
> CPU is accounted for, and no vector appears to share the resource among
> CPUs in different nodes.
>
>>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>>     >   done
>>     8
>>     8
>>     16-17,48,65,67,69
>>
>>     18-19,50,71,73,75
>>
>>     20,52,77,79
>>
>>     21,53,81,83
>>
>>     22,54,85,87
>>
>>     23,55,89,91
>>
>>     24,56,93,95
>>
>>     25,57,97,99
>>
>>     26,58,101,103
>>
>>     27,59,105,107
>>
>>     28,60,109,111
>>
>>     29,61,113,115
>>
>>     30,62,117,119
>>
>>     31,63,121,123
>>
>>     49,51,125,127
>>
>>     0,32,64,66
>>
>>     1,33,68,70
>>
>>     2,34,72,74
>>
>>     3,35,76,78
>>
>>     4,36,80,82
>>
>>     5,37,84,86
>>
>>     6,38,88,90
>>
>>     7,39,92,94
>>
>>     8,40,96,98
>>
>>     9,41,100,102
>>
>>     10,42,104,106
>>
>>     11,43,108,110
>>
>>     12,44,112,114
>>
>>     13,45,116,118
>>
>>     14,46,120,122
>>
>>     15,47,124,126
>>
>> The blank lines are because effective_smp_affinity is blank for all but the first interrupt.
>>
>> The problem is, even with this I still get the same performance
>> difference when running on Node 1 vs Node 0. I am not sure why. Any
>> pointers?
>
> I suspect effective_affinity isn't getting set and interrupts are
> triggering on unexpected CPUs. If you check /proc/interrupts, can you
> confirm if the interrupts are firing on CPUs within the
> smp_affinity_list or some other CPU?

All interrupts are hitting CPU0.

I did some more digging. Xen sets the effective affinity via the
function set_affinity_irq(). But it only sets it if info->evtchn is
valid. But info->evtchn is set by startup_pirq(), which is called _after_
set_affinity_irq().

This worked before because irqbalance was coming in after all this
happened and set the affinity, causing set_affinity_irq() to be called
again. By that time startup_pirq() has been called and info->evtchn is
set.

This whole thing needs some refactoring to work properly. I wrote up a
hacky patch (on top of my previous hacky patch) that fixes it. I will
write up a proper patch when I find some more time next week.

With this, I am now seeing good performance on node 1. I am seeing
slightly slower performance on node 1 but that might be due to some
other reasons and I do not want to dive into that for now.

Thanks for your help with this :-)

BTW, do you think I should re-send the patch with updated reasoning,
since Cristoph thinks we should do this regardless of the performance
reasons [0]?

[0] https://lore.kernel.org/linux-nvme/20230726131408.GA15909@lst.de/

----- 8< -----
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c7715f8bd4522..ed33d33a2f1c9 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -108,6 +108,7 @@ struct irq_info {
 	unsigned irq;
 	evtchn_port_t evtchn;   /* event channel */
 	unsigned short cpu;     /* cpu bound */
+	unsigned short suggested_cpu;
 	unsigned short eoi_cpu; /* EOI must happen on this cpu-1 */
 	unsigned int irq_epoch; /* If eoi_cpu valid: irq_epoch of event */
 	u64 eoi_time;           /* Time in jiffies when to EOI. */
@@ -871,6 +873,8 @@ static void mask_ack_pirq(struct irq_data *data)
 	eoi_pirq(data);
 }
 
+static int xen_rebind_evtchn_to_cpu(struct irq_info *info, unsigned int tcpu);
+
 static unsigned int __startup_pirq(unsigned int irq)
 {
 	struct evtchn_bind_pirq bind_pirq;
@@ -903,6 +907,14 @@ static unsigned int __startup_pirq(unsigned int irq)
 	info->evtchn = evtchn;
 	bind_evtchn_to_cpu(evtchn, 0, false);
 
+	if (info->suggested_cpu) {
+		int ret;
+		ret = xen_rebind_evtchn_to_cpu(info, info->suggested_cpu);
+		if (!ret)
+			irq_data_update_effective_affinity(irq_get_irq_data(irq),
+							cpumask_of(info->suggested_cpu));
+	}
+
 	rc = xen_evtchn_port_setup(evtchn);
 	if (rc)
 		goto err;
@@ -1856,12 +1872,17 @@ static unsigned int select_target_cpu(const struct cpumask *dest)
 static int set_affinity_irq(struct irq_data *data, const struct cpumask *dest,
 			    bool force)
 {
+	struct irq_info *info = info_for_irq(data->irq);
 	unsigned int tcpu = select_target_cpu(dest);
 	int ret;
 
-	ret = xen_rebind_evtchn_to_cpu(info_for_irq(data->irq), tcpu);
-	if (!ret)
+	ret = xen_rebind_evtchn_to_cpu(info, tcpu);
+	if (!ret) {
 		irq_data_update_effective_affinity(data, cpumask_of(tcpu));
+	} else {
+		if (info)
+			info->suggested_cpu = tcpu;
+	}
 
 	return ret;
 }


-- 
Regards,
Pratyush Yadav



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

next prev parent reply	other threads:[~2023-08-08 16:33 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-25 11:06 [PATCH] nvme-pci: do not set the NUMA node of device if it has none Pratyush Yadav
2023-07-25 14:35 ` Keith Busch
2023-07-26  7:58   ` Sagi Grimberg
2023-07-26 13:14     ` Christoph Hellwig
2023-07-26 15:30       ` Pratyush Yadav
2023-07-26 16:17         ` Keith Busch
2023-07-26 19:32           ` Pratyush Yadav
2023-07-26 22:25             ` Keith Busch
2023-07-28 18:09               ` Pratyush Yadav
2023-07-28 19:34                 ` Keith Busch
2023-08-04 14:50                   ` Pratyush Yadav
2023-08-04 15:19                     ` Keith Busch
2023-08-08 15:51                       ` Pratyush Yadav [this message]
2023-08-08 16:35                         ` Keith Busch
2024-07-23  9:49 ` Maurizio Lombardi
2024-07-23 14:39   ` Christoph Hellwig

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c7715f8bd452 dfblob:ed33d33a2f1c )
 OR (
bs:"Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mafs0y1il5y8q.fsf_-_@amazon.de \
    --to=ptyadav@amazon.de \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox