From: Hannes Reinecke <hare@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>, Keith Busch <kbusch@kernel.org>,
linux-nvme@lists.infradead.org, Hannes Reinecke <hare@kernel.org>
Subject: [PATCHv3 0/8] nvme-tcp: improve scalability
Date: Tue, 16 Jul 2024 09:36:08 +0200 [thread overview]
Message-ID: <20240716073616.84417-1-hare@kernel.org> (raw)
Hi all,
for workloads with a lot of controllers we run into workqueue contention,
where the single workqueue is not able to service requests fast enough,
leading to spurious I/O errors and connect resets during high load.
One culprit here was a lock contention on the callbacks, where we
acquired the 'sk_callback_lock()' on every callback. As we are dealing
with parallel rx and tx flows this induces quite a lot of contention.
I have also added instrumentation to analyse I/O flows, by adding
a I/O stall debug messages and added debugfs entries to display
detailed statistics for each queue.
All performance number are derived from the 'tiobench-example.fio'
sample from the fio sources, running on a 96 core machine with one,
two, or four subsystem and two paths, each path exposing 32 queues.
Backend is nvmet using an Intel DC P3700 NVMe SSD.
write performance:
baseline:
1 subsys, 4k seq: bw=523MiB/s (548MB/s), 16.3MiB/s-19.0MiB/s (17.1MB/s-20.0MB/s)
1 subsys, 4k rand: bw=502MiB/s (526MB/s), 15.7MiB/s-21.5MiB/s (16.4MB/s-22.5MB/s)
2 subsys, 4k seq: bw=420MiB/s (440MB/s), 2804KiB/s-4790KiB/s (2871kB/s-4905kB/s)
2 subsys, 4k rand: bw=416MiB/s (436MB/s), 2814KiB/s-5503KiB/s (2881kB/s-5635kB/s)
4 subsys, 4k seq: bw=409MiB/s (429MB/s), 1990KiB/s-8396KiB/s (2038kB/s-8598kB/s)
4 subsys, 4k rand: bw=386MiB/s (405MB/s), 2024KiB/s-6314KiB/s (2072kB/s-6466kB/s)
patched:
1 subsys, 4k seq: bw=440MiB/s (461MB/s), 13.7MiB/s-16.1MiB/s (14.4MB/s-16.8MB/s)
1 subsys, 4k rand: bw=427MiB/s (448MB/s), 13.4MiB/s-16.2MiB/s (13.0MB/s-16.0MB/s)
2 subsys, 4k seq: bw=506MiB/s (531MB/s), 3581KiB/s-4493KiB/s (3667kB/s-4601kB/s)
2 subsys, 4k rand: bw=494MiB/s (518MB/s), 3630KiB/s-4421KiB/s (3717kB/s-4528kB/s)
4 subsys, 4k seq: bw=457MiB/s (479MB/s), 2564KiB/s-8297KiB/s (2625kB/s-8496kB/s)
4 subsys, 4k rand: bw=424MiB/s (444MB/s), 2509KiB/s-9414KiB/s (2570kB/s-9640kB/s)
read performance:
baseline:
1 subsys, 4k seq: bw=389MiB/s (408MB/s), 12.2MiB/s-18.1MiB/s (12.7MB/s-18.0MB/s)
1 subsys, 4k rand: bw=430MiB/s (451MB/s), 13.5MiB/s-19.2MiB/s (14.1MB/s-20.2MB/s)
2 subsys, 4k seq: bw=377MiB/s (395MB/s), 2603KiB/s-3987KiB/s (2666kB/s-4083kB/s)
2 subsys, 4k rand: bw=377MiB/s (395MB/s), 2431KiB/s-5403KiB/s (2489kB/s-5533kB/s)
4 subsys, 4k seq: bw=139MiB/s (146MB/s), 197KiB/s-11.1MiB/s (202kB/s-11.6MB/s)
4 subsys, 4k rand: bw=352MiB/s (369MB/s), 1360KiB/s-13.9MiB/s (1392kB/s-14.6MB/s)
patched:
1 subsys, 4k seq: bw=405MiB/s (425MB/s), 2.7MiB/s-14.7MiB/s (13.3MB/s-15.4MB/s)
1 subsys, 4k rand: bw=427MiB/s (447MB/s), 13.3MiB/s-16.1MiB/s (13.0MB/s-16.9MB/s)
2 subsys, 4k seq: bw=411MiB/s (431MB/s), 2462KiB/s-4523KiB/s (2522kB/s-4632kB/s)
2 subsys, 4k rand: bw=392MiB/s (411MB/s), 2258KiB/s-4220KiB/s (2312kB/s-4321kB/s)
4 subsys, 4k seq: bw=378MiB/s (397MB/s), 1859KiB/s-8110KiB/s (1904kB/s-8305kB/s)
4 subsys, 4k rand: bw=326MiB/s (342MB/s), 1781KiB/s-4499KiB/s (1823kB/s-4607kB/s)
Keep in mind that there is a lot of fluctuation in the performance
numbers, especially in the baseline.
Changes to the initial submission:
- Make the changes independent from the 'wq_unbound' parameter
- Drop changes to the workqueue
- Add patch to improve rx/tx fairness
Changes to v2:
- Reworked patchset
- Switch deadline counter to microseconds instead of jiffies
- Add debug message for I/O stall debugging
- Add debugfs entries with I/O statistics
- Reduce callback lock contention
Hannes Reinecke (8):
nvme-tcp: switch TX deadline to microseconds and make it configurable
nvme-tcp: io_work stall debugging
nvme-tcp: re-init request list entries
nvme-tcp: improve stall debugging
nvme-tcp: debugfs entries for latency statistics
nvme-tcp: reduce callback lock contention
nvme-tcp: check for SOCK_NOSPACE before sending
nvme-tcp: align I/O cpu with blk-mq mapping
drivers/nvme/host/tcp.c | 384 ++++++++++++++++++++++++++++++++++++----
1 file changed, 351 insertions(+), 33 deletions(-)
--
2.35.3
next reply other threads:[~2024-07-16 7:36 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-16 7:36 Hannes Reinecke [this message]
2024-07-16 7:36 ` [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable Hannes Reinecke
2024-07-17 21:03 ` Sagi Grimberg
2024-07-18 6:30 ` Hannes Reinecke
2024-07-16 7:36 ` [PATCH 2/8] nvme-tcp: io_work stall debugging Hannes Reinecke
2024-07-17 21:05 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 3/8] nvme-tcp: re-init request list entries Hannes Reinecke
2024-07-17 21:23 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 4/8] nvme-tcp: improve stall debugging Hannes Reinecke
2024-07-17 21:11 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 5/8] nvme-tcp: debugfs entries for latency statistics Hannes Reinecke
2024-07-17 21:14 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 6/8] nvme-tcp: reduce callback lock contention Hannes Reinecke
2024-07-17 21:19 ` Sagi Grimberg
2024-07-18 6:42 ` Hannes Reinecke
2024-07-21 11:46 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 7/8] nvme-tcp: check for SOCK_NOSPACE before sending Hannes Reinecke
2024-07-17 21:19 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-07-17 21:34 ` Sagi Grimberg
2024-08-13 19:36 ` Sagi Grimberg
2024-07-17 21:01 ` [PATCHv3 0/8] nvme-tcp: improve scalability Sagi Grimberg
2024-07-18 6:20 ` Hannes Reinecke
2024-07-21 12:05 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240716073616.84417-1-hare@kernel.org \
--to=hare@kernel.org \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.