Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

From: Alexander Shumakovitch <shurik@jhu.edu>
To: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.
Date: Fri, 24 Mar 2023 21:19:09 +0000	[thread overview]
Message-ID: <ZB4Ty1L8Xa/qMQiV@hornet> (raw)
In-Reply-To: <e2df2f18-aaf9-89d5-6fed-aa1fb663f69c@opensource.wdc.com>

Hi Damien,

Thanks a lot for your thoughtful reply. The main reason why I used hdparm
and dd to benchmark the performance is because they are included with every
live distro. I didn't want to install an OS before confirming that hardware
works as expected.

Back to the main topic, it didn't occur to me that the --direct option can
have such a profound impact on reading speeds, but it does. With this
option enabled, most of the discrepancies in reading speeds from different
nodes disappear. The same happens when using dd with "iflag=direct". This
should imply that the issue is with the access time to the kernel's read
cache, correct? On the other hand, MLC shows completely reasonable latency
and bandwidth numbers between the nodes, see below.

So what could be the culprit and in which direction should I continue
digging? If hdparm and dd have issues with accessing the read cache, then
so will every other read-intensive program. Could this happen because of
the lack of the (correct) NUMA affinity for certain IRQs? I understand that
this question might not be NVMe-specific anymore, but would be grateful for
any pointer.

Thank you,

  --- Alex.

# ./mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.10
Command line parameters: --bandwidth_matrix

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1       2       3
       0        25328.8  4131.8  4013.0  4541.0
       1         4180.3 24696.3  4501.2  3996.3
       2         4017.7  4535.5 25746.4  4105.7
       3         4488.1  4024.0  4157.0 25467.7

# ./mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.10
Command line parameters: --latency_matrix

Using buffer size of 200.000MiB
Measuring idle latencies for sequential access (in ns)...
                Numa node
Numa node            0       1       2       3
       0          71.7   245.9   257.5   239.5
       1         156.4    71.8   238.3   256.3
       2         250.6   237.9    71.8   245.1
       3         238.4   252.5   237.9    71.9

On Fri, Mar 24, 2023 at 05:43:42PM +0900, Damien Le Moal wrote:
> 
> On 3/24/23 15:56, Alexander Shumakovitch wrote:
> > [ please copy me on your replies since I'm not subscribed to this list ]
> >
> > Hello all,
> >
> > I have an oldish quad socket server (Stratos S400-X44E by Quanta, 512GB RAM,
> > 4 x Xeon E5-4620) that I'm trying to upgrade with an NVMe Samsung 970 EVO
> > Plus SSD, connected via an adapter card to a PCIe slot, which is wired to
> > CPU #0 directly and supports PCIe 3.0 speeds. For some reason, the reading
> > speed from this SSD differs by a factor of 10 (ten!), depending on which
> > physical CPU hdparm or dd is run on:
> >
> >     # hdparm -t /dev/nvme0n1
> 
> It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
> benchmark an nvme device. At the very least, if you really want to measure the
> drive performance, you should add the --direct option (see man hdparm).
> 
> But a better way to test would be to use fio with io_uring or libaio IO engine
> doing multi-job & high QD --direct=1 IOs. That will give you the maximum
> performance of your device. Then remove the --direct=1 option to do buffered
> IOs, which will expose potential issues with your system memory bandwidth.
>

next prev parent reply	other threads:[~2023-03-24 21:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <ZB1JgJ2DxyTMVUHB@hornet>
2023-03-24  8:43 ` Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine Damien Le Moal
2023-03-24 21:19   ` Alexander Shumakovitch [this message]
2023-03-25  1:52     ` Damien Le Moal
2023-03-31  7:53       ` Alexander Shumakovitch
2023-03-25  0:33   ` Alexander Shumakovitch
2023-03-25  1:56     ` Damien Le Moal
2023-03-24 19:34 ` Keith Busch
2023-03-24 21:38   ` Alexander Shumakovitch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZB4Ty1L8Xa/qMQiV@hornet \
    --to=shurik@jhu.edu \
    --cc=damien.lemoal@opensource.wdc.com \
    --cc=linux-nvme@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox