From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To: Alexander Shumakovitch <shurik@jhu.edu>,
"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
Date: Fri, 24 Mar 2023 17:43:42 +0900 [thread overview]
Message-ID: <e2df2f18-aaf9-89d5-6fed-aa1fb663f69c@opensource.wdc.com> (raw)
In-Reply-To: <ZB1JgJ2DxyTMVUHB@hornet>
On 3/24/23 15:56, Alexander Shumakovitch wrote:
> [ please copy me on your replies since I'm not subscribed to this list ]
>
> Hello all,
>
> I have an oldish quad socket server (Stratos S400-X44E by Quanta, 512GB RAM,
> 4 x Xeon E5-4620) that I'm trying to upgrade with an NVMe Samsung 970 EVO
> Plus SSD, connected via an adapter card to a PCIe slot, which is wired to
> CPU #0 directly and supports PCIe 3.0 speeds. For some reason, the reading
> speed from this SSD differs by a factor of 10 (ten!), depending on which
> physical CPU hdparm or dd is run on:
>
> # hdparm -t /dev/nvme0n1
It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
benchmark an nvme device. At the very least, if you really want to measure the
drive performance, you should add the --direct option (see man hdparm).
But a better way to test would be to use fio with io_uring or libaio IO engine
doing multi-job & high QD --direct=1 IOs. That will give you the maximum
performance of your device. Then remove the --direct=1 option to do buffered
IOs, which will expose potential issues with your system memory bandwidth.
>
> /dev/nvme0n1:
> Timing buffered disk reads: 510 MB in 3.01 seconds = 169.28 MB/sec
>
> # taskset -c 0-7 hdparm -t /dev/nvme0n1
>
> /dev/nvme0n1:
> Timing buffered disk reads: 5252 MB in 3.00 seconds = 1750.28 MB/sec
>
> # taskset -c 8-15 hdparm -t /dev/nvme0n1
>
> /dev/nvme0n1:
> Timing buffered disk reads: 496 MB in 3.01 seconds = 164.83 MB/sec
>
> # taskset -c 24-31 hdparm -t /dev/nvme0n1
>
> /dev/nvme0n1:
> Timing buffered disk reads: 520 MB in 3.01 seconds = 172.65 MB/sec
>
> Even more mysteriously, the writing speeds are consistent across all the
> CPUs at about 800MB/sec (see the output of dd attached). Please note that
> I'm not worrying about the fine tuning of the performance at this point,
> and in particular I'm perfectly fine with 1/2 of the theoretical reading
> speed. I just want to understand where 90% of the bandwidth gets lost.
> No error of any kind appears in the syslog.
>
> I don't think this is NUMA related since the QPI interconnect runs as
> specced at 4GB/sec, when measured by Intel's Memory Latency Checker, more
> than enough for NVMe to run at full speed. Also, the CUDA benchmark test
> runs at expected speeds across the QPI.
>
> Just in case, I'm attaching the output of lstopo to this message. Please
> note that this computer has a BIOS bug that doesn't let kernel populate
> the values of numa_node in /sys/devices/pci0000:* automatically, so I have
> to do this myself after each boot.
>
> I've tried removing all other PCI add-on cards, moving the SSD to another
> slot, changing the number of polling queues for the nvme driver, and even
> setting dm-multipath up. But none of these makes any material difference
> in reading speed.
>
> System info: Debian 11.6 (stable) running Linux 5.19.11 (config file attached)
> Output of "nvme list":
>
> Node SN Model Namespace Usage Format FW Rev
> ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
> /dev/nvme0n1 S58SNS0R705048H Samsung SSD 970 EVO Plus 500GB 1 0.00 B / 500.11 GB 512 B + 0 B 2B2QEXM7
>
> Output of "nvme list-subsys"":
>
> nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:144d144dS58SNS0R705048H Samsung SSD 970 EVO Plus 500GB
> \
> +- nvme0 pcie 0000:03:00.0 live
>
> I would be grateful if you could point me in the right direction. I'm
> attaching outputs of the following commands to this message: dmesg,
> "cat /proc/cpuinfo", "ls -vvv", lstopo, and dd (both for reading from
> and writing to this SSD). Please let me know if you need any other info
> from me.
>
> Thank you,
>
> Alex Shumakovitch
--
Damien Le Moal
Western Digital Research
next parent reply other threads:[~2023-03-24 8:43 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <ZB1JgJ2DxyTMVUHB@hornet>
2023-03-24 8:43 ` Damien Le Moal [this message]
2023-03-24 21:19 ` Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine Alexander Shumakovitch
2023-03-25 1:52 ` Damien Le Moal
2023-03-31 7:53 ` Alexander Shumakovitch
2023-03-25 0:33 ` Alexander Shumakovitch
2023-03-25 1:56 ` Damien Le Moal
2023-03-24 19:34 ` Keith Busch
2023-03-24 21:38 ` Alexander Shumakovitch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e2df2f18-aaf9-89d5-6fed-aa1fb663f69c@opensource.wdc.com \
--to=damien.lemoal@opensource.wdc.com \
--cc=linux-nvme@lists.infradead.org \
--cc=shurik@jhu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox