public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To: Alexander Shumakovitch <shurik@jhu.edu>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
Date: Fri, 24 Mar 2023 17:43:42 +0900	[thread overview]
Message-ID: <e2df2f18-aaf9-89d5-6fed-aa1fb663f69c@opensource.wdc.com> (raw)
In-Reply-To: <ZB1JgJ2DxyTMVUHB@hornet>

On 3/24/23 15:56, Alexander Shumakovitch wrote:
> [ please copy me on your replies since I'm not subscribed to this list ]
> 
> Hello all,
> 
> I have an oldish quad socket server (Stratos S400-X44E by Quanta, 512GB RAM,
> 4 x Xeon E5-4620) that I'm trying to upgrade with an NVMe Samsung 970 EVO
> Plus SSD, connected via an adapter card to a PCIe slot, which is wired to
> CPU #0 directly and supports PCIe 3.0 speeds. For some reason, the reading
> speed from this SSD differs by a factor of 10 (ten!), depending on which
> physical CPU hdparm or dd is run on:
>       
>     # hdparm -t /dev/nvme0n1 

It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
benchmark an nvme device. At the very least, if you really want to measure the
drive performance, you should add the --direct option (see man hdparm).

But a better way to test would be to use fio with io_uring or libaio IO engine
doing multi-job & high QD --direct=1 IOs. That will give you the maximum
performance of your device. Then remove the --direct=1 option to do buffered
IOs, which will expose potential issues with your system memory bandwidth.

>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 510 MB in  3.01 seconds = 169.28 MB/sec
>     
>     # taskset -c 0-7 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 5252 MB in  3.00 seconds = 1750.28 MB/sec
>     
>     # taskset -c 8-15 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 496 MB in  3.01 seconds = 164.83 MB/sec
>     
>     # taskset -c 24-31 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 520 MB in  3.01 seconds = 172.65 MB/sec
> 
> Even more mysteriously, the writing speeds are consistent across all the
> CPUs at about 800MB/sec (see the output of dd attached). Please note that
> I'm not worrying about the fine tuning of the performance at this point,
> and in particular I'm perfectly fine with 1/2 of the theoretical reading
> speed. I just want to understand where 90% of the bandwidth gets lost.
> No error of any kind appears in the syslog.
> 
> I don't think this is NUMA related since the QPI interconnect runs as
> specced at 4GB/sec, when measured by Intel's Memory Latency Checker, more
> than enough for NVMe to run at full speed. Also, the CUDA benchmark test
> runs at expected speeds across the QPI.
> 
> Just in case, I'm attaching the output of lstopo to this message. Please
> note that this computer has a BIOS bug that doesn't let kernel populate
> the values of numa_node in /sys/devices/pci0000:* automatically, so I have
> to do this myself after each boot.
> 
> I've tried removing all other PCI add-on cards, moving the SSD to another
> slot, changing the number of polling queues for the nvme driver, and even
> setting dm-multipath up. But none of these makes any material difference
> in reading speed.
> 
> System info: Debian 11.6 (stable) running Linux 5.19.11 (config file attached)
> Output of "nvme list":
> 
>     Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
>     ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
>     /dev/nvme0n1     S58SNS0R705048H      Samsung SSD 970 EVO Plus 500GB           1           0.00   B / 500.11  GB    512   B +  0 B   2B2QEXM7
> 
> Output of "nvme list-subsys"":
> 
>     nvme-subsys0 - NQN=nqn.2014.08.org.nvmexpress:144d144dS58SNS0R705048H     Samsung SSD 970 EVO Plus 500GB          
>     \
>      +- nvme0 pcie 0000:03:00.0 live 
> 
> I would be grateful if you could point me in the right direction. I'm
> attaching outputs of the following commands to this message: dmesg,
> "cat /proc/cpuinfo", "ls -vvv", lstopo, and dd (both for reading from
> and writing to this SSD). Please let me know if you need any other info
> from me.
> 
> Thank you,
> 
>    Alex Shumakovitch

-- 
Damien Le Moal
Western Digital Research



       reply	other threads:[~2023-03-24  8:43 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ZB1JgJ2DxyTMVUHB@hornet>
2023-03-24  8:43 ` Damien Le Moal [this message]
2023-03-24 21:19   ` Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine Alexander Shumakovitch
2023-03-25  1:52     ` Damien Le Moal
2023-03-31  7:53       ` Alexander Shumakovitch
2023-03-25  0:33   ` Alexander Shumakovitch
2023-03-25  1:56     ` Damien Le Moal
2023-03-24 19:34 ` Keith Busch
2023-03-24 21:38   ` Alexander Shumakovitch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e2df2f18-aaf9-89d5-6fed-aa1fb663f69c@opensource.wdc.com \
    --to=damien.lemoal@opensource.wdc.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=shurik@jhu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox