[PATCHv3 0/8] nvme-tcp: improve scalability

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Hannes Reinecke <hare@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>, Keith Busch <kbusch@kernel.org>,
	linux-nvme@lists.infradead.org, Hannes Reinecke <hare@kernel.org>
Subject: [PATCHv3 0/8] nvme-tcp: improve scalability
Date: Tue, 16 Jul 2024 09:36:08 +0200	[thread overview]
Message-ID: <20240716073616.84417-1-hare@kernel.org> (raw)


Hi all,

for workloads with a lot of controllers we run into workqueue contention,
where the single workqueue is not able to service requests fast enough,
leading to spurious I/O errors and connect resets during high load.
One culprit here was a lock contention on the callbacks, where we
acquired the 'sk_callback_lock()' on every callback. As we are dealing
with parallel rx and tx flows this induces quite a lot of contention.
I have also added instrumentation to analyse I/O flows, by adding
a I/O stall debug messages and added debugfs entries to display
detailed statistics for each queue.

All performance number are derived from the 'tiobench-example.fio'
sample from the fio sources, running on a 96 core machine with one,
two, or four subsystem and two paths, each path exposing 32 queues.
Backend is nvmet using an Intel DC P3700 NVMe SSD.

write performance:
baseline:
1 subsys, 4k seq:  bw=523MiB/s (548MB/s), 16.3MiB/s-19.0MiB/s (17.1MB/s-20.0MB/s)
1 subsys, 4k rand: bw=502MiB/s (526MB/s), 15.7MiB/s-21.5MiB/s (16.4MB/s-22.5MB/s)
2 subsys, 4k seq:  bw=420MiB/s (440MB/s), 2804KiB/s-4790KiB/s (2871kB/s-4905kB/s)
2 subsys, 4k rand: bw=416MiB/s (436MB/s), 2814KiB/s-5503KiB/s (2881kB/s-5635kB/s)
4 subsys, 4k seq:  bw=409MiB/s (429MB/s), 1990KiB/s-8396KiB/s (2038kB/s-8598kB/s)
4 subsys, 4k rand: bw=386MiB/s (405MB/s), 2024KiB/s-6314KiB/s (2072kB/s-6466kB/s)

patched:
1 subsys, 4k seq:  bw=440MiB/s (461MB/s), 13.7MiB/s-16.1MiB/s (14.4MB/s-16.8MB/s)
1 subsys, 4k rand: bw=427MiB/s (448MB/s), 13.4MiB/s-16.2MiB/s (13.0MB/s-16.0MB/s)
2 subsys, 4k seq:  bw=506MiB/s (531MB/s), 3581KiB/s-4493KiB/s (3667kB/s-4601kB/s)
2 subsys, 4k rand: bw=494MiB/s (518MB/s), 3630KiB/s-4421KiB/s (3717kB/s-4528kB/s)
4 subsys, 4k seq:  bw=457MiB/s (479MB/s), 2564KiB/s-8297KiB/s (2625kB/s-8496kB/s)
4 subsys, 4k rand: bw=424MiB/s (444MB/s), 2509KiB/s-9414KiB/s (2570kB/s-9640kB/s)

read performance:
baseline:
1 subsys, 4k seq:  bw=389MiB/s (408MB/s), 12.2MiB/s-18.1MiB/s (12.7MB/s-18.0MB/s)
1 subsys, 4k rand: bw=430MiB/s (451MB/s), 13.5MiB/s-19.2MiB/s (14.1MB/s-20.2MB/s)
2 subsys, 4k seq:  bw=377MiB/s (395MB/s), 2603KiB/s-3987KiB/s (2666kB/s-4083kB/s)
2 subsys, 4k rand: bw=377MiB/s (395MB/s), 2431KiB/s-5403KiB/s (2489kB/s-5533kB/s)
4 subsys, 4k seq:  bw=139MiB/s (146MB/s), 197KiB/s-11.1MiB/s (202kB/s-11.6MB/s)
4 subsys, 4k rand: bw=352MiB/s (369MB/s), 1360KiB/s-13.9MiB/s (1392kB/s-14.6MB/s)

patched:
1 subsys, 4k seq:  bw=405MiB/s (425MB/s), 2.7MiB/s-14.7MiB/s (13.3MB/s-15.4MB/s)
1 subsys, 4k rand: bw=427MiB/s (447MB/s), 13.3MiB/s-16.1MiB/s (13.0MB/s-16.9MB/s)
2 subsys, 4k seq:  bw=411MiB/s (431MB/s), 2462KiB/s-4523KiB/s (2522kB/s-4632kB/s)
2 subsys, 4k rand: bw=392MiB/s (411MB/s), 2258KiB/s-4220KiB/s (2312kB/s-4321kB/s)
4 subsys, 4k seq:  bw=378MiB/s (397MB/s), 1859KiB/s-8110KiB/s (1904kB/s-8305kB/s)
4 subsys, 4k rand: bw=326MiB/s (342MB/s), 1781KiB/s-4499KiB/s (1823kB/s-4607kB/s)

Keep in mind that there is a lot of fluctuation in the performance
numbers, especially in the baseline.

Changes to the initial submission:
- Make the changes independent from the 'wq_unbound' parameter
- Drop changes to the workqueue
- Add patch to improve rx/tx fairness

Changes to v2:
- Reworked patchset
- Switch deadline counter to microseconds instead of jiffies
- Add debug message for I/O stall debugging
- Add debugfs entries with I/O statistics
- Reduce callback lock contention

Hannes Reinecke (8):
  nvme-tcp: switch TX deadline to microseconds and make it configurable
  nvme-tcp: io_work stall debugging
  nvme-tcp: re-init request list entries
  nvme-tcp: improve stall debugging
  nvme-tcp: debugfs entries for latency statistics
  nvme-tcp: reduce callback lock contention
  nvme-tcp: check for SOCK_NOSPACE before sending
  nvme-tcp: align I/O cpu with blk-mq mapping

 drivers/nvme/host/tcp.c | 384 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 351 insertions(+), 33 deletions(-)

-- 
2.35.3

next             reply	other threads:[~2024-07-16  7:36 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-16  7:36 Hannes Reinecke [this message]
2024-07-16  7:36 ` [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable Hannes Reinecke
2024-07-17 21:03   ` Sagi Grimberg
2024-07-18  6:30     ` Hannes Reinecke
2024-07-16  7:36 ` [PATCH 2/8] nvme-tcp: io_work stall debugging Hannes Reinecke
2024-07-17 21:05   ` Sagi Grimberg
2024-07-16  7:36 ` [PATCH 3/8] nvme-tcp: re-init request list entries Hannes Reinecke
2024-07-17 21:23   ` Sagi Grimberg
2024-07-16  7:36 ` [PATCH 4/8] nvme-tcp: improve stall debugging Hannes Reinecke
2024-07-17 21:11   ` Sagi Grimberg
2024-07-16  7:36 ` [PATCH 5/8] nvme-tcp: debugfs entries for latency statistics Hannes Reinecke
2024-07-17 21:14   ` Sagi Grimberg
2024-07-16  7:36 ` [PATCH 6/8] nvme-tcp: reduce callback lock contention Hannes Reinecke
2024-07-17 21:19   ` Sagi Grimberg
2024-07-18  6:42     ` Hannes Reinecke
2024-07-21 11:46       ` Sagi Grimberg
2024-07-16  7:36 ` [PATCH 7/8] nvme-tcp: check for SOCK_NOSPACE before sending Hannes Reinecke
2024-07-17 21:19   ` Sagi Grimberg
2024-07-16  7:36 ` [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-07-17 21:34   ` Sagi Grimberg
2024-08-13 19:36     ` Sagi Grimberg
2024-07-17 21:01 ` [PATCHv3 0/8] nvme-tcp: improve scalability Sagi Grimberg
2024-07-18  6:20   ` Hannes Reinecke
2024-07-21 12:05     ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240716073616.84417-1-hare@kernel.org \
    --to=hare@kernel.org \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.