Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Pratyush Yadav <ptyadav@amazon.de>
To: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
	"Jens Axboe" <axboe@kernel.dk>, <linux-nvme@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none
Date: Fri, 28 Jul 2023 20:09:32 +0200	[thread overview]
Message-ID: <mafs0tttn7vs3.fsf_-_@amazon.de> (raw)
In-Reply-To: <ZMGddjINDt10BSvf@kbusch-mbp.dhcp.thefacebook.com> (Keith Busch's message of "Wed, 26 Jul 2023 16:25:58 -0600")

Hi,

On Wed, Jul 26 2023, Keith Busch wrote:

> On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote:
>> On Wed, Jul 26 2023, Keith Busch wrote:
>> > Could you send the output of:
>> >
>> >   numactl --hardware
>>
>> $ numactl --hardware
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 0 size: 245847 MB
>> node 0 free: 245211 MB
>> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
>> node 1 size: 245932 MB
>> node 1 free: 245328 MB
>> node distances:
>> node   0   1
>>   0:  10  21
>>   1:  21  10
>>
>> >
>> > and then with and without your patch:
>> >
>> >   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>> >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>> >   done
>>
>> Without my patch:
>>
>>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>>     >   done
>
> Hm, I wonder if there's something wrong with my script. All the cpu's
> should be accounted for in the smp_affinity_list, assuming it captured
> all the vectors of the nvme device, but both examples are missing half
> the CPUs. It looks like you have 32 vectors. Does that sound right?

Yes, there are 32 vectors, from nvme0q0 to nvme0q31. Should there be one
vector for each CPU? Perhaps the device does not support that many
queues?

FWIW,

    $ sudo nvme get-feature /dev/nvme0n1 -f 7 -H
    get-feature:0x7 (Number of Queues), Current value:0x1e001e
            Number of IO Completion Queues Allocated (NCQA): 31
            Number of IO Submission Queues Allocated (NSQA): 31

>
> This does show the effective affinity is indeed always on node 0 without
> your patch. I don't see why, though: the "group_cpus_evenly()" function
> that spreads the interrupts doesn't know anything about the device the
> resource is being grouped for, so it shouldn't even take its NUMA node
> into consideration. It's just supposed to ensure all CPUs have a shared
> resource, preferring to not share across numa nodes.

I am guessing you are looking at irq_create_affinity_masks(). Yeah, It
does not take into account the NUMA information. In fact, even if it
did, the NUMA node associated with the IRQ is NUMA_NO_NODE
(/proc/$irq/node == -1).

I did some more digging over the week to figure out what is going on. It
seems like the kernel _does_ in fact allow all CPUs in the affinity. I
added some prints in set_affinity_irq() in
drivers/xen/events/events_base.c (since that is the irqchip for the
interrupt). I see it being called with mask: ffffffff,ffffffff.

But I later see the function being called again with a different mask:
00000000,00008000. The stack trace shows the call is coming from
ksys_write(). The process doing the write is irqbalance.

So I think your earlier statement was incorrect. irqbalance does in fact
balance these interrupts and it probably looks at the NUMA information
of the device to make that decision. My original reasoning holds and
irqbalance is the one picking the affinity.

With this explanation, do you think the patch is good to go?

BTW, could you please also add the below when applying? I forgot to add
it when sending the patch.

  Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq")

>
> I'll emulate a similar CPU topology with similar nvme vector count and
> see if I can find anything suspicious. I'm a little concerned we may
> have the same problem for devices that have an associated NUMA node that
> your patch isn't addressing.
>
[...]

--
Regards,
Pratyush Yadav

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

next prev parent reply	other threads:[~2023-07-28 18:10 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-25 11:06 [PATCH] nvme-pci: do not set the NUMA node of device if it has none Pratyush Yadav
2023-07-25 14:35 ` Keith Busch
2023-07-26  7:58   ` Sagi Grimberg
2023-07-26 13:14     ` Christoph Hellwig
2023-07-26 15:30       ` Pratyush Yadav
2023-07-26 16:17         ` Keith Busch
2023-07-26 19:32           ` Pratyush Yadav
2023-07-26 22:25             ` Keith Busch
2023-07-28 18:09               ` Pratyush Yadav [this message]
2023-07-28 19:34                 ` Keith Busch
2023-08-04 14:50                   ` Pratyush Yadav
2023-08-04 15:19                     ` Keith Busch
2023-08-08 15:51                       ` Pratyush Yadav
2023-08-08 16:35                         ` Keith Busch
2024-07-23  9:49 ` Maurizio Lombardi
2024-07-23 14:39   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mafs0tttn7vs3.fsf_-_@amazon.de \
    --to=ptyadav@amazon.de \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox