public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>, Keith Busch <kbusch@kernel.org>,
	linux-nvme@lists.infradead.org, Hannes Reinecke <hare@suse.de>
Subject: [PATCH 0/7] nvme-tcp scalability improvements
Date: Wed, 26 Jun 2024 14:13:40 +0200	[thread overview]
Message-ID: <20240626121347.1116-1-hare@kernel.org> (raw)

From: Hannes Reinecke <hare@suse.de>

Hi all,

we have had reports from partners that nvme-tcp suffers from scalability
problems with the number of controllers; they even managed to run into
a request timeout by just connecting enough controllers to the host.

Looking into it I have found several issues with the nvme-tcp implementation:
- the 'io_cpu' assignment is static, leading to the same calculation
  for each controller. Thus each queue with the same number is
  assigned the same CPU, leading to CPU starvation.
- The blk-mq cpu mapping is not taken into account when calculating
  the 'io_cpu' number, leading to excessive thread bouncing during I/O
- The socket state is not evaluating, so we're piling more and more
  requests onto the socket even when it's already full.

This patchset addresses these issues, leading to a better I/O
distribution for several controllers.

Performance for read increases from:
4k seq read: bw=368MiB/s (386MB/s), 11.5MiB/s-12.7MiB/s
  (12.1MB/s-13.3MB/s), io=16.0GiB (17.2GB), run=40444-44468msec
4k rand read: bw=360MiB/s (378MB/s), 11.3MiB/s-12.1MiB/s
  (11.8MB/s-12.7MB/s), io=16.0GiB (17.2GB), run=42310-45502msec
to:
4k seq read: bw=520MiB/s (545MB/s), 16.3MiB/s-21.1MiB/s
  (17.0MB/s-22.2MB/s), io=16.0GiB (17.2GB), run=24208-31505msec
4k rand read: bw=533MiB/s (559MB/s), 16.7MiB/s-22.2MiB/s
  (17.5MB/s-23.3MB/s), io=16.0GiB (17.2GB), run=23014-30731msec

However, peak write performance degrades from:
4k seq write: bw=657MiB/s (689MB/s), 20.5MiB/s-20.7MiB/s
  (21.5MB/s-21.8MB/s), io=16.0GiB (17.2GB), run=24678-24950msec
4k rand write: bw=687MiB/s (720MB/s), 21.5MiB/s-21.7MiB/s
  (22.5MB/s-22.8MB/s), io=16.0GiB (17.2GB), run=23559-23859msec
to:
4k seq write: bw=535MiB/s (561MB/s), 16.7MiB/s-19.9MiB/s
  (17.5MB/s-20.9MB/s), io=16.0GiB (17.2GB), run=25707-30624msec
4k rand write: bw=560MiB/s (587MB/s), 17.5MiB/s-22.3MiB/s
  (18.4MB/s-23.4MB/s), io=16.0GiB (17.2GB), run=22977-29248msec

which is not surprising, seeing that the original implementation would
be pushing as many writes as possible to the workqueue, with complete
disregard of the utilisation of the queue (which was precisely the
issue we're addressing here).

Hannes Reinecke (5):
  nvme-tcp: align I/O cpu with blk-mq mapping
  nvme-tcp: distribute queue affinity
  nvmet-tcp: add wq_unbound module parameter
  nvme-tcp: SOCK_NOSPACE handling
  nvme-tcp: make softirq_rx the default

Sagi Grimberg (2):
  net: micro-optimize skb_datagram_iter
  nvme-tcp: receive data in softirq

 drivers/nvme/host/tcp.c   | 126 ++++++++++++++++++++++++++++----------
 drivers/nvme/target/tcp.c |  34 +++++++---
 net/core/datagram.c       |   4 +-
 3 files changed, 122 insertions(+), 42 deletions(-)

-- 
2.35.3



             reply	other threads:[~2024-06-26 12:14 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-26 12:13 Hannes Reinecke [this message]
2024-06-26 12:13 ` [PATCH 1/7] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-06-26 12:13 ` [PATCH 2/7] nvme-tcp: distribute queue affinity Hannes Reinecke
2024-06-26 13:38   ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 3/7] net: micro-optimize skb_datagram_iter Hannes Reinecke
2024-06-26 13:38   ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 4/7] nvme-tcp: receive data in softirq Hannes Reinecke
2024-06-26 12:13 ` [PATCH 5/7] nvmet-tcp: add wq_unbound module parameter Hannes Reinecke
2024-06-26 13:44   ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 6/7] nvme-tcp: SOCK_NOSPACE handling Hannes Reinecke
2024-06-26 13:45   ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 7/7] nvme-tcp: make softirq_rx the default Hannes Reinecke
2024-06-26 13:46   ` Sagi Grimberg
2024-06-26 13:37 ` [PATCH 0/7] nvme-tcp scalability improvements Sagi Grimberg
2024-06-26 14:27   ` Hannes Reinecke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240626121347.1116-1-hare@kernel.org \
    --to=hare@kernel.org \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox