All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pratyush Yadav <ptyadav@amazon.de>
To: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>, Keith Busch <kbusch@kernel.org>,
	Jens Axboe <axboe@kernel.dk>, <linux-nvme@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none
Date: Wed, 26 Jul 2023 17:30:33 +0200	[thread overview]
Message-ID: <mafs0cz0e8zc6.fsf_-_@amazon.de> (raw)
In-Reply-To: <20230726131408.GA15909@lst.de> (Christoph Hellwig's message of "Wed, 26 Jul 2023 15:14:08 +0200")

On Wed, Jul 26 2023, Christoph Hellwig wrote:

Hi all,

> On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote:
>>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA
>>>> information for the NVMe devices. This means all NVMe devices have
>>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance
>>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50%
>>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on
>>>> both nodes get similar performance (around 315k IOPS).
>>>
>>> irqbalance doesn't work with this driver though: the interrupts are
>>> managed by the kernel. Is there some other reason to explain the perf
>>> difference?

Hmm, I did not know that. I have not gone and looked at the code but I
think the same reasoning should hold, just with s/irqbalance/kernel. If
the kernel IRQ balancer sees the device is on node 0, it would deliver
its interrupts to CPUs on node 0.

In my tests I can see that the interrupts for NVME queues are sent only
to CPUs from node 0 without this patch. With this patch CPUs from both
nodes get the interrupts.

>>
>> Maybe its because the numa_node goes to the tagset which allocates
>> stuff based on that numa-node ?
>
> Yeah, the only explanation I could come up with is that without this
> the allocations gets spread, and that somehow helps.  All of this
> is a little obscure, but so is the NVMe practice of setting the node id
> to first_memory_node, which no other driver does.  I'd really like to
> understand what's going on here first.  After that this patch probably
> is the right thing, I'd just like to understand why.

See above for my conjecture on why this happens.

More specifically, I discovered this when running an application pinned
to a node 1 CPU reading from an NVME device. I noticed it was performing
worse than when it was pinned to node 0.

If the process is free to move around it might not see such a large
performance difference since it could move to a node 0 CPU. But if it is
pinned to a CPU in node 1 then the interrupt will always hit a node 0
CPU and create higher latency for the reads.

I have a simple fio test that can reproduce this. Save this [1] as 
fio.txt and then run numactl --cpunodebind 1 fio ./fio.txt. You can run
it on any host with an NVME device that has no NUMA node. I have tested
this on AWS EC2's i3.16xlarge instance type.

[1]
    [global]
    ioengine=libaio
    filename=/dev/nvme0n1
    group_reporting=1
    direct=1
    verify=0
    norandommap=0
    size=10%
    time_based=1
    runtime=30
    ramp_time=0
    randrepeat=0
    log_max_value=1
    unified_rw_reporting=1
    percentile_list=50:99:99.9:99.99:99.999
    bwavgtime=10000

    [4k_randread_qd16_4w]
    stonewall
    bs=4k
    rw=randread
    iodepth=32
    numjobs=1

-- 
Regards,
Pratyush Yadav



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879





  reply	other threads:[~2023-07-26 15:30 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-25 11:06 [PATCH] nvme-pci: do not set the NUMA node of device if it has none Pratyush Yadav
2023-07-25 14:35 ` Keith Busch
2023-07-26  7:58   ` Sagi Grimberg
2023-07-26 13:14     ` Christoph Hellwig
2023-07-26 15:30       ` Pratyush Yadav [this message]
2023-07-26 16:17         ` Keith Busch
2023-07-26 19:32           ` Pratyush Yadav
2023-07-26 22:25             ` Keith Busch
2023-07-28 18:09               ` Pratyush Yadav
2023-07-28 19:34                 ` Keith Busch
2023-08-04 14:50                   ` Pratyush Yadav
2023-08-04 15:19                     ` Keith Busch
2023-08-08 15:51                       ` Pratyush Yadav
2023-08-08 16:35                         ` Keith Busch
2024-07-23  9:49 ` Maurizio Lombardi
2024-07-23 14:39   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mafs0cz0e8zc6.fsf_-_@amazon.de \
    --to=ptyadav@amazon.de \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.