* [PATCH 00/11] NVMe over MPTCP
@ 2026-05-28 3:10 Geliang Tang
2026-05-28 3:10 ` [PATCH 01/11] nvmet-tcp: define accept tcp_proto struct Geliang Tang
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
This series (previously named "MPTCP support to NVMe over TCP") had three
RFC versions sent to Hannes in May 2025, with subsequent revisions based on
his input. Following that, I initiated the process of upstreaming the
dependent "mptcp: implement .read_sock" series, which was merged into the
Linux kernel in February 2026.
After several rounds of iteration on the MPTCP mailing list, this set
addresses all the reviewer comments (including Sashiko's) and fixes the
identified issues.
This topic was presented as a discussion item at LSF/MM/BPF 2026.
During the "NVMe over MPTCP" [1] discussion at the conference, it was
concluded that MPTCP should be treated as a new transport type, rather than
a TCP variant. A request will be submitted to the NVMe working group to
officially allocate a transport value for MPTCP.
This series runs without any user space changes (libnvme, nvme-cli).
Later, MPTCP KTLS support will be added, and a follow-up series will be
sent to enable TLS for NVMe over MPTCP.
Based on NVMe Multipath and Block Multiqueue, each TCP queue is converted
into one MPTCP queue. This is achieved by abstracting six socket helpers
(set_nodelay, set_reuseaddr, no_linger, etc.) into per-transport
structures. Inside each MPTCP queue, multiple subflows using different
IP addresses aggregate multi-NIC bandwidth and provide fail-over
resilience.
Patch 10 demonstrates that with a single NVMe multipath configuration and
four network interfaces, MPTCP achieves four times the bandwidth of TCP.
Patch 11 demonstrates that with four NVMe multipath paths, using the
round-robin I/O policy and a lossy four-interface environment, MPTCP
still achieves four times the bandwidth of TCP.
[1]
https://lore.kernel.org/linux-nvme/a9f115aa5719e1088702a3fdeee766a3166611b1.camel@kernel.org/
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Geliang Tang (11):
nvmet-tcp: define accept tcp_proto struct
nvmet-tcp: implement accept mptcp proto
nvmet-tcp: define listen socket ops
nvmet-tcp: register target mptcp transport
nvmet-tcp: implement mptcp listen socket ops
nvme-fabrics: compare transport in ip_options_match
nvme-tcp: define host tcp_proto struct
nvme-tcp: register host mptcp transport
nvme-tcp: implement host mptcp proto
selftests: mptcp: add nvme over mptcp test
selftests: mptcp: nvme: add iopolicy tests
drivers/nvme/host/fabrics.c | 1 +
drivers/nvme/host/tcp.c | 101 ++++-
drivers/nvme/target/configfs.c | 1 +
drivers/nvme/target/tcp.c | 128 +++++-
include/linux/nvme.h | 1 +
include/net/mptcp.h | 31 ++
net/mptcp/sockopt.c | 149 +++++++
tools/testing/selftests/net/mptcp/Makefile | 1 +
tools/testing/selftests/net/mptcp/config | 8 +
.../testing/selftests/net/mptcp/mptcp_lib.sh | 12 +
.../testing/selftests/net/mptcp/mptcp_nvme.sh | 397 ++++++++++++++++++
11 files changed, 813 insertions(+), 17 deletions(-)
create mode 100755 tools/testing/selftests/net/mptcp/mptcp_nvme.sh
--
2.53.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 01/11] nvmet-tcp: define accept tcp_proto struct
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 02/11] nvmet-tcp: implement accept mptcp proto Geliang Tang
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
To handle accepted sockets, this patch adds struct nvmet_tcp_proto to
hold accept socket operations (no_linger, set_priority, set_tos, ops).
A proto field is added to struct nvmet_tcp_queue, which points to the
appropriate protocol structure. A TCP version is defined and assigned
to queue->proto for TCP connections.
Also modify nvmet_tcp_set_queue_sock() and nvmet_tcp_done_recv_pdu()
to use queue->proto for socket operations and fabrics callbacks.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/target/tcp.c | 40 +++++++++++++++++++++++++++++++++------
1 file changed, 34 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 20f150d17a96..01c23fb15b79 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -145,6 +145,13 @@ enum nvmet_tcp_queue_state {
NVMET_TCP_Q_FAILED,
};
+struct nvmet_tcp_proto {
+ void (*no_linger)(struct sock *sk);
+ void (*set_priority)(struct sock *sk, u32 priority);
+ void (*set_tos)(struct sock *sk);
+ const struct nvmet_fabrics_ops *ops;
+};
+
struct nvmet_tcp_queue {
struct socket *sock;
struct nvmet_tcp_port *port;
@@ -196,6 +203,7 @@ struct nvmet_tcp_queue {
void (*data_ready)(struct sock *);
void (*state_change)(struct sock *);
void (*write_space)(struct sock *);
+ const struct nvmet_tcp_proto *proto;
};
struct nvmet_tcp_port {
@@ -1081,7 +1089,8 @@ static int nvmet_tcp_done_recv_pdu(struct nvmet_tcp_queue *queue)
req = &queue->cmd->req;
memcpy(req->cmd, nvme_cmd, sizeof(*nvme_cmd));
- if (unlikely(!nvmet_req_init(req, &queue->nvme_sq, &nvmet_tcp_ops))) {
+ if (unlikely(!nvmet_req_init(req, &queue->nvme_sq,
+ queue->proto->ops))) {
pr_err("failed cmd %p id %d opcode %d, data_len: %d, status: %04x\n",
req->cmd, req->cmd->common.command_id,
req->cmd->common.opcode,
@@ -1698,7 +1707,6 @@ static void nvmet_tcp_state_change(struct sock *sk)
static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
{
struct socket *sock = queue->sock;
- struct inet_sock *inet = inet_sk(sock->sk);
int ret;
ret = kernel_getsockname(sock,
@@ -1716,14 +1724,13 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
* close. This is done to prevent stale data from being sent should
* the network connection be restored before TCP times out.
*/
- sock_no_linger(sock->sk);
+ queue->proto->no_linger(sock->sk);
if (so_priority > 0)
- sock_set_priority(sock->sk, so_priority);
+ queue->proto->set_priority(sock->sk, so_priority);
/* Set socket type of service */
- if (inet->rcv_tos > 0)
- ip_sock_set_tos(sock->sk, inet->rcv_tos);
+ queue->proto->set_tos(sock->sk);
ret = 0;
write_lock_bh(&sock->sk->sk_callback_lock);
@@ -1906,6 +1913,21 @@ static int nvmet_tcp_tls_handshake(struct nvmet_tcp_queue *queue)
static void nvmet_tcp_tls_handshake_timeout(struct work_struct *w) {}
#endif
+static void tcp_sock_set_tos(struct sock *sk)
+{
+ struct inet_sock *inet = inet_sk(sk);
+
+ if (inet->rcv_tos > 0)
+ ip_sock_set_tos(sk, inet->rcv_tos);
+}
+
+static const struct nvmet_tcp_proto nvmet_tcp_proto = {
+ .no_linger = sock_no_linger,
+ .set_priority = sock_set_priority,
+ .set_tos = tcp_sock_set_tos,
+ .ops = &nvmet_tcp_ops,
+};
+
static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
struct socket *newsock)
{
@@ -1923,6 +1945,12 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
kref_init(&queue->kref);
queue->sock = newsock;
+ if (newsock->sk->sk_protocol == IPPROTO_TCP) {
+ queue->proto = &nvmet_tcp_proto;
+ } else {
+ ret = -EINVAL;
+ goto out_free_queue;
+ }
queue->port = port;
queue->nr_cmds = 0;
spin_lock_init(&queue->state_lock);
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 02/11] nvmet-tcp: implement accept mptcp proto
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
2026-05-28 3:10 ` [PATCH 01/11] nvmet-tcp: define accept tcp_proto struct Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 03/11] nvmet-tcp: define listen socket ops Geliang Tang
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
An MPTCP-specific version of struct nvmet_tcp_proto is implemented for
accept sockets. It is assigned to queue->proto when the accepted socket
protocol is IPPROTO_MPTCP.
Dedicated MPTCP helpers are introduced for setting accept socket options.
These helpers (no_linger, set_priority, set_tos) set the values on all
existing subflows using mptcp_for_each_subflow(). The values are then
synchronized to other newly created subflows in sync_socket_options().
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/target/tcp.c | 16 ++++++++
include/net/mptcp.h | 12 ++++++
net/mptcp/sockopt.c | 79 +++++++++++++++++++++++++++++++++++++++
3 files changed, 107 insertions(+)
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 01c23fb15b79..16f153a9772b 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -220,6 +220,9 @@ static DEFINE_MUTEX(nvmet_tcp_queue_mutex);
static struct workqueue_struct *nvmet_tcp_wq;
static const struct nvmet_fabrics_ops nvmet_tcp_ops;
+#ifdef CONFIG_MPTCP
+static const struct nvmet_fabrics_ops nvmet_mptcp_ops;
+#endif
static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd);
@@ -1928,6 +1931,15 @@ static const struct nvmet_tcp_proto nvmet_tcp_proto = {
.ops = &nvmet_tcp_ops,
};
+#ifdef CONFIG_MPTCP
+static const struct nvmet_tcp_proto nvmet_mptcp_proto = {
+ .no_linger = mptcp_sock_no_linger,
+ .set_priority = mptcp_sock_set_priority,
+ .set_tos = mptcp_sock_set_tos,
+ .ops = &nvmet_mptcp_ops,
+};
+#endif
+
static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
struct socket *newsock)
{
@@ -1947,6 +1959,10 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
queue->sock = newsock;
if (newsock->sk->sk_protocol == IPPROTO_TCP) {
queue->proto = &nvmet_tcp_proto;
+#ifdef CONFIG_MPTCP
+ } else if (newsock->sk->sk_protocol == IPPROTO_MPTCP) {
+ queue->proto = &nvmet_mptcp_proto;
+#endif
} else {
ret = -EINVAL;
goto out_free_queue;
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index aef2dbeb847b..bf74dedc578d 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -233,6 +233,12 @@ static inline __be32 mptcp_reset_option(const struct sk_buff *skb)
}
void mptcp_active_detect_blackhole(struct sock *sk, bool expired);
+
+void mptcp_sock_no_linger(struct sock *sk);
+
+void mptcp_sock_set_priority(struct sock *sk, u32 priority);
+
+void mptcp_sock_set_tos(struct sock *sk);
#else
static inline void mptcp_init(void)
@@ -319,6 +325,12 @@ static inline struct request_sock *mptcp_subflow_reqsk_alloc(const struct reques
static inline __be32 mptcp_reset_option(const struct sk_buff *skb) { return htonl(0u); }
static inline void mptcp_active_detect_blackhole(struct sock *sk, bool expired) { }
+
+static inline void mptcp_sock_no_linger(struct sock *sk) { }
+
+static inline void mptcp_sock_set_priority(struct sock *sk, u32 priority) { }
+
+static inline void mptcp_sock_set_tos(struct sock *sk) { }
#endif /* CONFIG_MPTCP */
#if IS_ENABLED(CONFIG_MPTCP_IPV6)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 87b5796d0135..359b1eb2d0a9 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1662,3 +1662,82 @@ int mptcp_set_rcvlowat(struct sock *sk, int val)
}
return 0;
}
+
+void mptcp_sock_no_linger(struct sock *sk)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ struct sock *ssk;
+
+ lock_sock(sk);
+ sockopt_seq_inc(msk);
+ WRITE_ONCE(sk->sk_lingertime, 0);
+ sock_set_flag(sk, SOCK_LINGER);
+ mptcp_for_each_subflow(msk, subflow) {
+ ssk = mptcp_subflow_tcp_sock(subflow);
+ if (ssk) {
+ lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
+ WRITE_ONCE(ssk->sk_lingertime, 0);
+ sock_set_flag(ssk, SOCK_LINGER);
+ release_sock(ssk);
+ }
+ }
+ release_sock(sk);
+}
+EXPORT_SYMBOL(mptcp_sock_no_linger);
+
+void mptcp_sock_set_priority(struct sock *sk, u32 priority)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ struct sock *ssk;
+
+ lock_sock(sk);
+ sockopt_seq_inc(msk);
+ sock_set_priority(sk, priority);
+ mptcp_for_each_subflow(msk, subflow) {
+ ssk = mptcp_subflow_tcp_sock(subflow);
+ if (ssk) {
+ lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
+ sock_set_priority(ssk, priority);
+ release_sock(ssk);
+ }
+ }
+ release_sock(sk);
+}
+EXPORT_SYMBOL(mptcp_sock_set_priority);
+
+static void __mptcp_sock_set_tos(struct sock *sk, int val)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ struct sock *ssk;
+
+ lock_sock(sk);
+ sockopt_seq_inc(msk);
+ __ip_sock_set_tos(sk, val);
+ mptcp_for_each_subflow(msk, subflow) {
+ ssk = mptcp_subflow_tcp_sock(subflow);
+ if (ssk) {
+ lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
+ __ip_sock_set_tos(ssk, val);
+ release_sock(ssk);
+ }
+ }
+ release_sock(sk);
+}
+
+void mptcp_sock_set_tos(struct sock *sk)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ int val = 0;
+
+ lock_sock(sk);
+ if (msk->first)
+ val = inet_sk(msk->first)->rcv_tos;
+ release_sock(sk);
+
+ if (val > 0)
+ __mptcp_sock_set_tos(sk, val);
+}
+EXPORT_SYMBOL(mptcp_sock_set_tos);
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 03/11] nvmet-tcp: define listen socket ops
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
2026-05-28 3:10 ` [PATCH 01/11] nvmet-tcp: define accept tcp_proto struct Geliang Tang
2026-05-28 3:10 ` [PATCH 02/11] nvmet-tcp: implement accept mptcp proto Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 04/11] nvmet-tcp: register target mptcp transport Geliang Tang
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
To support MPTCP on the target side, the listen socket needs to pass
IPPROTO_MPTCP to sock_create() for MPTCP ports, and use MPTCP-specific
setsockopt functions.
This patch adds struct nvmet_tcp_proto_ops to hold listen socket protocol
operations (protocol, set_reuseaddr, set_nodelay, set_priority). A TCP
version is defined and used for TCP ports.
v2:
- use trtype instead of tsas (Hannes).
v3:
- check mptcp protocol from disc_addr.trtype instead of passing a
parameter (Hannes).
v4:
- check CONFIG_MPTCP.
v5:
- define nvmet_tcp_proto struct.
- add a pointer to this structure in nvmet_tcp_port.
v6:
- split nvmet_tcp_proto struct into two structs, nvmet_tcp_proto and
nvmet_tcp_proto_ops, one for the accept socket, the other for the liston
socket.
- add a pointer to nvmet_tcp_proto struct in nvmet_tcp_queue.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/target/tcp.c | 30 ++++++++++++++++++++++++++----
1 file changed, 26 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 16f153a9772b..83fe001fc619 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -2087,8 +2087,23 @@ static void nvmet_tcp_listen_data_ready(struct sock *sk)
read_unlock_bh(&sk->sk_callback_lock);
}
+struct nvmet_tcp_proto_ops {
+ int protocol;
+ void (*set_reuseaddr)(struct sock *sk);
+ void (*set_nodelay)(struct sock *sk);
+ void (*set_priority)(struct sock *sk, u32 priority);
+};
+
+static const struct nvmet_tcp_proto_ops nvmet_tcp_proto_ops = {
+ .protocol = IPPROTO_TCP,
+ .set_reuseaddr = sock_set_reuseaddr,
+ .set_nodelay = tcp_sock_set_nodelay,
+ .set_priority = sock_set_priority,
+};
+
static int nvmet_tcp_add_port(struct nvmet_port *nport)
{
+ const struct nvmet_tcp_proto_ops *ops;
struct nvmet_tcp_port *port;
__kernel_sa_family_t af;
int ret;
@@ -2111,6 +2126,13 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
goto err_port;
}
+ if (nport->disc_addr.trtype == NVMF_TRTYPE_TCP) {
+ ops = &nvmet_tcp_proto_ops;
+ } else {
+ ret = -EINVAL;
+ goto err_port;
+ }
+
ret = inet_pton_with_scope(&init_net, af, nport->disc_addr.traddr,
nport->disc_addr.trsvcid, &port->addr);
if (ret) {
@@ -2125,7 +2147,7 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
port->nport->inline_data_size = NVMET_TCP_DEF_INLINE_DATA_SIZE;
ret = sock_create(port->addr.ss_family, SOCK_STREAM,
- IPPROTO_TCP, &port->sock);
+ ops->protocol, &port->sock);
if (ret) {
pr_err("failed to create a socket\n");
goto err_port;
@@ -2134,10 +2156,10 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
port->sock->sk->sk_user_data = port;
port->data_ready = port->sock->sk->sk_data_ready;
port->sock->sk->sk_data_ready = nvmet_tcp_listen_data_ready;
- sock_set_reuseaddr(port->sock->sk);
- tcp_sock_set_nodelay(port->sock->sk);
+ ops->set_reuseaddr(port->sock->sk);
+ ops->set_nodelay(port->sock->sk);
if (so_priority > 0)
- sock_set_priority(port->sock->sk, so_priority);
+ ops->set_priority(port->sock->sk, so_priority);
ret = kernel_bind(port->sock, (struct sockaddr_unsized *)&port->addr,
sizeof(port->addr));
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 04/11] nvmet-tcp: register target mptcp transport
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (2 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 03/11] nvmet-tcp: define listen socket ops Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 05/11] nvmet-tcp: implement mptcp listen socket ops Geliang Tang
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
This patch adds a new nvme target transport type NVMF_TRTYPE_MPTCP for
MPTCP. And defines a new nvmet_fabrics_ops named nvmet_mptcp_ops, which
is almost the same as nvmet_tcp_ops except .type. It is registered in
nvmet_tcp_init() and unregistered in nvmet_tcp_exit().
A MODULE_ALIAS for "nvmet-transport-4" is also added.
Note: NVMF_TRTYPE_MPTCP is temporarily assigned 4, a value currently
reserved in the NVMe over Fabrics specification. During "NVMe over
MPTCP" discussion at the LSF/MM/BPF 2026 conference, it was concluded
that MPTCP should be treated as a new transport type, rather than a
TCP variant. A request will be submitted to the NVMe working group to
officially allocate this value for MPTCP.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/target/configfs.c | 1 +
drivers/nvme/target/tcp.c | 29 +++++++++++++++++++++++++++++
include/linux/nvme.h | 1 +
3 files changed, 31 insertions(+)
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index b88f897f06e2..51fc0f4d0c32 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -37,6 +37,7 @@ static struct nvmet_type_name_map nvmet_transport[] = {
{ NVMF_TRTYPE_RDMA, "rdma" },
{ NVMF_TRTYPE_FC, "fc" },
{ NVMF_TRTYPE_TCP, "tcp" },
+ { NVMF_TRTYPE_MPTCP, "mptcp" },
{ NVMF_TRTYPE_PCI, "pci" },
{ NVMF_TRTYPE_LOOP, "loop" },
};
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 83fe001fc619..e2f3de364c2b 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -2299,6 +2299,23 @@ static const struct nvmet_fabrics_ops nvmet_tcp_ops = {
.host_traddr = nvmet_tcp_host_port_addr,
};
+#ifdef CONFIG_MPTCP
+static bool nvmet_mptcp_registered;
+
+static const struct nvmet_fabrics_ops nvmet_mptcp_ops = {
+ .owner = THIS_MODULE,
+ .type = NVMF_TRTYPE_MPTCP,
+ .msdbd = 1,
+ .add_port = nvmet_tcp_add_port,
+ .remove_port = nvmet_tcp_remove_port,
+ .queue_response = nvmet_tcp_queue_response,
+ .delete_ctrl = nvmet_tcp_delete_ctrl,
+ .install_queue = nvmet_tcp_install_queue,
+ .disc_traddr = nvmet_tcp_disc_port_addr,
+ .host_traddr = nvmet_tcp_host_port_addr,
+};
+#endif
+
static int __init nvmet_tcp_init(void)
{
int ret;
@@ -2312,6 +2329,11 @@ static int __init nvmet_tcp_init(void)
if (ret)
goto err;
+#ifdef CONFIG_MPTCP
+ if (!nvmet_register_transport(&nvmet_mptcp_ops))
+ nvmet_mptcp_registered = true;
+#endif
+
return 0;
err:
destroy_workqueue(nvmet_tcp_wq);
@@ -2322,6 +2344,10 @@ static void __exit nvmet_tcp_exit(void)
{
struct nvmet_tcp_queue *queue;
+#ifdef CONFIG_MPTCP
+ if (nvmet_mptcp_registered)
+ nvmet_unregister_transport(&nvmet_mptcp_ops);
+#endif
nvmet_unregister_transport(&nvmet_tcp_ops);
flush_workqueue(nvmet_wq);
@@ -2341,3 +2367,6 @@ module_exit(nvmet_tcp_exit);
MODULE_DESCRIPTION("NVMe target TCP transport driver");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS("nvmet-transport-3"); /* 3 == NVMF_TRTYPE_TCP */
+#ifdef CONFIG_MPTCP
+MODULE_ALIAS("nvmet-transport-4"); /* 4 == NVMF_TRTYPE_MPTCP */
+#endif
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 041f30931a90..0eada1e0c652 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -68,6 +68,7 @@ enum {
NVMF_TRTYPE_RDMA = 1, /* RDMA */
NVMF_TRTYPE_FC = 2, /* Fibre Channel */
NVMF_TRTYPE_TCP = 3, /* TCP/IP */
+ NVMF_TRTYPE_MPTCP = 4, /* Multipath TCP */
NVMF_TRTYPE_LOOP = 254, /* Reserved for host usage */
NVMF_TRTYPE_MAX,
};
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 05/11] nvmet-tcp: implement mptcp listen socket ops
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (3 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 04/11] nvmet-tcp: register target mptcp transport Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 06/11] nvme-fabrics: compare transport in ip_options_match Geliang Tang
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
An MPTCP-specific version of struct nvmet_tcp_proto_ops is implemented
for listen sockets. It is assigned to port->proto_ops when the transport
type is MPTCP.
Dedicated MPTCP helpers are introduced for setting listen socket options.
The set_nodelay and set_priority helpers set the values on all existing
subflows using mptcp_for_each_subflow(). The set_reuseaddr helper only
applies to the first subflow. The values are then synchronized to other
newly created subflows in sync_socket_options().
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/target/tcp.c | 13 ++++++++++++
include/net/mptcp.h | 8 ++++++++
net/mptcp/sockopt.c | 42 +++++++++++++++++++++++++++++++++++++++
3 files changed, 63 insertions(+)
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index e2f3de364c2b..8c2dc4bcbcd3 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -2101,6 +2101,15 @@ static const struct nvmet_tcp_proto_ops nvmet_tcp_proto_ops = {
.set_priority = sock_set_priority,
};
+#ifdef CONFIG_MPTCP
+static const struct nvmet_tcp_proto_ops nvmet_mptcp_proto_ops = {
+ .protocol = IPPROTO_MPTCP,
+ .set_reuseaddr = mptcp_sock_set_reuseaddr,
+ .set_nodelay = mptcp_sock_set_nodelay,
+ .set_priority = mptcp_sock_set_priority,
+};
+#endif
+
static int nvmet_tcp_add_port(struct nvmet_port *nport)
{
const struct nvmet_tcp_proto_ops *ops;
@@ -2128,6 +2137,10 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
if (nport->disc_addr.trtype == NVMF_TRTYPE_TCP) {
ops = &nvmet_tcp_proto_ops;
+#ifdef CONFIG_MPTCP
+ } else if (nport->disc_addr.trtype == NVMF_TRTYPE_MPTCP) {
+ ops = &nvmet_mptcp_proto_ops;
+#endif
} else {
ret = -EINVAL;
goto err_port;
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index bf74dedc578d..b8ab214a7890 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -239,6 +239,10 @@ void mptcp_sock_no_linger(struct sock *sk);
void mptcp_sock_set_priority(struct sock *sk, u32 priority);
void mptcp_sock_set_tos(struct sock *sk);
+
+void mptcp_sock_set_reuseaddr(struct sock *sk);
+
+void mptcp_sock_set_nodelay(struct sock *sk);
#else
static inline void mptcp_init(void)
@@ -331,6 +335,10 @@ static inline void mptcp_sock_no_linger(struct sock *sk) { }
static inline void mptcp_sock_set_priority(struct sock *sk, u32 priority) { }
static inline void mptcp_sock_set_tos(struct sock *sk) { }
+
+static inline void mptcp_sock_set_reuseaddr(struct sock *sk) { }
+
+static inline void mptcp_sock_set_nodelay(struct sock *sk) { }
#endif /* CONFIG_MPTCP */
#if IS_ENABLED(CONFIG_MPTCP_IPV6)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 359b1eb2d0a9..0adbbe568f6e 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1596,6 +1596,8 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
inet_assign_bit(FREEBIND, ssk, inet_test_bit(FREEBIND, sk));
inet_assign_bit(BIND_ADDRESS_NO_PORT, ssk, inet_test_bit(BIND_ADDRESS_NO_PORT, sk));
WRITE_ONCE(inet_sk(ssk)->local_port_range, READ_ONCE(inet_sk(sk)->local_port_range));
+
+ ssk->sk_reuse = sk->sk_reuse;
}
void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk)
@@ -1741,3 +1743,43 @@ void mptcp_sock_set_tos(struct sock *sk)
__mptcp_sock_set_tos(sk, val);
}
EXPORT_SYMBOL(mptcp_sock_set_tos);
+
+void mptcp_sock_set_reuseaddr(struct sock *sk)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct sock *ssk;
+
+ lock_sock(sk);
+ sockopt_seq_inc(msk);
+ sk->sk_reuse = SK_CAN_REUSE;
+ ssk = __mptcp_nmpc_sk(msk);
+ if (IS_ERR(ssk))
+ goto unlock;
+ lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
+ ssk->sk_reuse = SK_CAN_REUSE;
+ release_sock(ssk);
+unlock:
+ release_sock(sk);
+}
+EXPORT_SYMBOL(mptcp_sock_set_reuseaddr);
+
+void mptcp_sock_set_nodelay(struct sock *sk)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ struct sock *ssk;
+
+ lock_sock(sk);
+ sockopt_seq_inc(msk);
+ msk->nodelay = true;
+ mptcp_for_each_subflow(msk, subflow) {
+ ssk = mptcp_subflow_tcp_sock(subflow);
+ if (ssk) {
+ lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
+ __tcp_sock_set_nodelay(ssk, true);
+ release_sock(ssk);
+ }
+ }
+ release_sock(sk);
+}
+EXPORT_SYMBOL(mptcp_sock_set_nodelay);
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 06/11] nvme-fabrics: compare transport in ip_options_match
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (4 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 05/11] nvmet-tcp: implement mptcp listen socket ops Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 07/11] nvme-tcp: define host tcp_proto struct Geliang Tang
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
When checking for an existing controller, nvmf_ip_options_match() does
not compare the transport type. This can cause a TCP connection request
to incorrectly match an existing MPTCP controller, or an MPTCP connection
request to match an existing TCP controller, resulting in a false
-EALREADY error.
Fix this by adding strcmp(opts->transport, ctrl->opts->transport) to the
matching condition.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/host/fabrics.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index ac3d4f400601..e086e61e8f94 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -1220,6 +1220,7 @@ bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
struct nvmf_ctrl_options *opts)
{
if (!nvmf_ctlr_matches_baseopts(ctrl, opts) ||
+ strcmp(opts->transport, ctrl->opts->transport) ||
strcmp(opts->traddr, ctrl->opts->traddr) ||
strcmp(opts->trsvcid, ctrl->opts->trsvcid))
return false;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 07/11] nvme-tcp: define host tcp_proto struct
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (5 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 06/11] nvme-fabrics: compare transport in ip_options_match Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 08/11] nvme-tcp: register host mptcp transport Geliang Tang
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
To add MPTCP support in "NVMe over TCP", the host side needs to pass
IPPROTO_MPTCP to sock_create_kern() instead of IPPROTO_TCP to create an
MPTCP socket.
Similar to the target-side nvmet_tcp_proto, this patch defines the
host-side nvme_tcp_proto structure, which contains the protocol of the
socket and a set of function pointers for socket operations. The only
difference is that it defines .set_syncnt instead of .set_reuseaddr.
A TCP-specific version of this structure is defined, and a proto field is
added to nvme_tcp_ctrl. When the transport string is "tcp", it is assigned
to ctrl->proto.
All locations that previously called TCP setsockopt functions are updated
to call the corresponding function pointers in the nvme_tcp_proto
structure. The controller's proto pointer is set during initialization and
remains valid throughout the controller's lifetime.
v2:
- use 'trtype' instead of '--mptcp' (Hannes)
v3:
- check mptcp protocol from opts->transport instead of passing a
parameter (Hannes).
v4:
- check CONFIG_MPTCP.
v5:
- define nvme_tcp_proto struct.
- add a pointer to this structure in nvme_tcp_ctrl.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/host/tcp.c | 44 ++++++++++++++++++++++++++++++++++-------
1 file changed, 37 insertions(+), 7 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 15d36d6a728e..13a5240623ef 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -182,6 +182,16 @@ struct nvme_tcp_queue {
void (*write_space)(struct sock *);
};
+struct nvme_tcp_proto {
+ int protocol;
+ int (*set_syncnt)(struct sock *sk, int val);
+ void (*set_nodelay)(struct sock *sk);
+ void (*no_linger)(struct sock *sk);
+ void (*set_priority)(struct sock *sk, u32 priority);
+ void (*set_tos)(struct sock *sk, int val);
+ const struct nvme_ctrl_ops *ops;
+};
+
struct nvme_tcp_ctrl {
/* read only in the hot path */
struct nvme_tcp_queue *queues;
@@ -198,6 +208,8 @@ struct nvme_tcp_ctrl {
struct delayed_work connect_work;
struct nvme_tcp_request async_req;
u32 io_queues[HCTX_MAX_TYPES];
+
+ const struct nvme_tcp_proto *proto;
};
static LIST_HEAD(nvme_tcp_ctrl_list);
@@ -1799,7 +1811,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
ret = sock_create_kern(current->nsproxy->net_ns,
ctrl->addr.ss_family, SOCK_STREAM,
- IPPROTO_TCP, &queue->sock);
+ ctrl->proto->protocol, &queue->sock);
if (ret) {
dev_err(nctrl->device,
"failed to create socket: %d\n", ret);
@@ -1816,24 +1828,24 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
nvme_tcp_reclassify_socket(queue->sock);
/* Single syn retry */
- tcp_sock_set_syncnt(queue->sock->sk, 1);
+ ctrl->proto->set_syncnt(queue->sock->sk, 1);
/* Set TCP no delay */
- tcp_sock_set_nodelay(queue->sock->sk);
+ ctrl->proto->set_nodelay(queue->sock->sk);
/*
* Cleanup whatever is sitting in the TCP transmit queue on socket
* close. This is done to prevent stale data from being sent should
* the network connection be restored before TCP times out.
*/
- sock_no_linger(queue->sock->sk);
+ ctrl->proto->no_linger(queue->sock->sk);
if (so_priority > 0)
- sock_set_priority(queue->sock->sk, so_priority);
+ ctrl->proto->set_priority(queue->sock->sk, so_priority);
/* Set socket type of service */
if (nctrl->opts->tos >= 0)
- ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
+ ctrl->proto->set_tos(queue->sock->sk, nctrl->opts->tos);
/* Set 10 seconds timeout for icresp recvmsg */
queue->sock->sk->sk_rcvtimeo = 10 * HZ;
@@ -2900,6 +2912,17 @@ nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
return found;
}
+static const struct nvme_tcp_proto nvme_tcp_proto = {
+ .protocol = IPPROTO_TCP,
+ .set_syncnt = tcp_sock_set_syncnt,
+ .set_nodelay = tcp_sock_set_nodelay,
+ .no_linger = sock_no_linger,
+ .set_priority = sock_set_priority,
+ .set_tos = ip_sock_set_tos,
+ .ops = &nvme_tcp_ctrl_ops,
+
+};
+
static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
struct nvmf_ctrl_options *opts)
{
@@ -2964,13 +2987,20 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
goto out_free_ctrl;
}
+ if (!strcmp(ctrl->ctrl.opts->transport, "tcp")) {
+ ctrl->proto = &nvme_tcp_proto;
+ } else {
+ ret = -EINVAL;
+ goto out_free_ctrl;
+ }
+
ctrl->queues = kzalloc_objs(*ctrl->queues, ctrl->ctrl.queue_count);
if (!ctrl->queues) {
ret = -ENOMEM;
goto out_free_ctrl;
}
- ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_tcp_ctrl_ops, 0);
+ ret = nvme_init_ctrl(&ctrl->ctrl, dev, ctrl->proto->ops, 0);
if (ret)
goto out_kfree_queues;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 08/11] nvme-tcp: register host mptcp transport
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (6 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 07/11] nvme-tcp: define host tcp_proto struct Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 09/11] nvme-tcp: implement host mptcp proto Geliang Tang
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
This patch defines a new nvmf_transport_ops named nvme_mptcp_transport,
which is almost the same as nvme_tcp_transport except .name and
.allowed_opts.
MPTCP currently does not support TLS. The four TLS-related options
(NVMF_OPT_TLS, NVMF_OPT_KEYRING, NVMF_OPT_TLS_KEY, and NVMF_OPT_CONCAT)
have been removed from allowed_opts. They will be added back once MPTCP
TLS is supported.
It is registered in nvme_tcp_init_module() and unregistered in
nvme_tcp_cleanup_module().
A MODULE_ALIAS("nvme-mptcp") declaration is added at the end of the file.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/host/tcp.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 13a5240623ef..305624d59c50 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -3067,6 +3067,20 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
.create_ctrl = nvme_tcp_create_ctrl,
};
+#ifdef CONFIG_MPTCP
+static struct nvmf_transport_ops nvme_mptcp_transport = {
+ .name = "mptcp",
+ .module = THIS_MODULE,
+ .required_opts = NVMF_OPT_TRADDR,
+ .allowed_opts = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
+ NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
+ NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
+ NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
+ NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE,
+ .create_ctrl = nvme_tcp_create_ctrl,
+};
+#endif
+
static int __init nvme_tcp_init_module(void)
{
unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
@@ -3092,6 +3106,9 @@ static int __init nvme_tcp_init_module(void)
atomic_set(&nvme_tcp_cpu_queues[cpu], 0);
nvmf_register_transport(&nvme_tcp_transport);
+#ifdef CONFIG_MPTCP
+ nvmf_register_transport(&nvme_mptcp_transport);
+#endif
return 0;
}
@@ -3099,6 +3116,9 @@ static void __exit nvme_tcp_cleanup_module(void)
{
struct nvme_tcp_ctrl *ctrl;
+#ifdef CONFIG_MPTCP
+ nvmf_unregister_transport(&nvme_mptcp_transport);
+#endif
nvmf_unregister_transport(&nvme_tcp_transport);
mutex_lock(&nvme_tcp_ctrl_mutex);
@@ -3116,3 +3136,6 @@ module_exit(nvme_tcp_cleanup_module);
MODULE_DESCRIPTION("NVMe host TCP transport driver");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS("nvme-tcp");
+#ifdef CONFIG_MPTCP
+MODULE_ALIAS("nvme-mptcp");
+#endif
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 09/11] nvme-tcp: implement host mptcp proto
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (7 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 08/11] nvme-tcp: register host mptcp transport Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 10/11] selftests: mptcp: add nvme over mptcp test Geliang Tang
2026-05-28 3:10 ` [PATCH 11/11] selftests: mptcp: nvme: add iopolicy tests Geliang Tang
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
An MPTCP-specific version of struct nvme_tcp_proto is implemented,
and it is assigned to ctrl->proto when the transport string is "mptcp".
The socket option setting logic is similar to the target side, except that
mptcp_sock_set_syncnt is newly defined for the host side.
These helpers set the values on all existing subflows of an MPTCP
connection, except for set_reuseaddr which only applies to the first
subflow. The values are then synchronized to other newly created
subflows in sync_socket_options().
A separate nvme_mptcp_ctrl_ops structure with .name = "mptcp" is defined
and used for MPTCP controllers.
"mptcp" is planned to be introduced as a new NVMe transport type into the
NVMe Base Specification in the future.
Currently, the Discovery Log does not yet recognize trtype=4 (MPTCP), and
will show "trtype: unrecognized" for such entries:
=====Discovery Log Entry 0======
trtype: unrecognized
adrfam: ipv4
subtype: current discovery subsystem
treq: not specified, sq flow control disable supported
portid: 23106
trsvcid: 23601
subnqn: nqn.2014-08.org.nvmexpress.discovery
traddr: 10.1.1.1
eflags: none
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
drivers/nvme/host/tcp.c | 34 ++++++++++++++++++++++++++++++++++
include/net/mptcp.h | 11 +++++++++++
net/mptcp/sockopt.c | 30 +++++++++++++++++++++++++++++-
3 files changed, 74 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 305624d59c50..2388a8c443cc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2895,6 +2895,24 @@ static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
.get_virt_boundary = nvmf_get_virt_boundary,
};
+#ifdef CONFIG_MPTCP
+static const struct nvme_ctrl_ops nvme_mptcp_ctrl_ops = {
+ .name = "mptcp",
+ .module = THIS_MODULE,
+ .flags = NVME_F_FABRICS | NVME_F_BLOCKING,
+ .reg_read32 = nvmf_reg_read32,
+ .reg_read64 = nvmf_reg_read64,
+ .reg_write32 = nvmf_reg_write32,
+ .subsystem_reset = nvmf_subsystem_reset,
+ .free_ctrl = nvme_tcp_free_ctrl,
+ .submit_async_event = nvme_tcp_submit_async_event,
+ .delete_ctrl = nvme_tcp_delete_ctrl,
+ .get_address = nvme_tcp_get_address,
+ .stop_ctrl = nvme_tcp_stop_ctrl,
+ .get_virt_boundary = nvmf_get_virt_boundary,
+};
+#endif
+
static bool
nvme_tcp_existing_controller(struct nvmf_ctrl_options *opts)
{
@@ -2923,6 +2941,18 @@ static const struct nvme_tcp_proto nvme_tcp_proto = {
};
+#ifdef CONFIG_MPTCP
+static const struct nvme_tcp_proto nvme_mptcp_proto = {
+ .protocol = IPPROTO_MPTCP,
+ .set_syncnt = mptcp_sock_set_syncnt,
+ .set_nodelay = mptcp_sock_set_nodelay,
+ .no_linger = mptcp_sock_no_linger,
+ .set_priority = mptcp_sock_set_priority,
+ .set_tos = __mptcp_sock_set_tos,
+ .ops = &nvme_mptcp_ctrl_ops,
+};
+#endif
+
static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
struct nvmf_ctrl_options *opts)
{
@@ -2989,6 +3019,10 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
if (!strcmp(ctrl->ctrl.opts->transport, "tcp")) {
ctrl->proto = &nvme_tcp_proto;
+#ifdef CONFIG_MPTCP
+ } else if (!strcmp(ctrl->ctrl.opts->transport, "mptcp")) {
+ ctrl->proto = &nvme_mptcp_proto;
+#endif
} else {
ret = -EINVAL;
goto out_free_ctrl;
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index b8ab214a7890..160267e35b13 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -238,11 +238,15 @@ void mptcp_sock_no_linger(struct sock *sk);
void mptcp_sock_set_priority(struct sock *sk, u32 priority);
+void __mptcp_sock_set_tos(struct sock *sk, int val);
+
void mptcp_sock_set_tos(struct sock *sk);
void mptcp_sock_set_reuseaddr(struct sock *sk);
void mptcp_sock_set_nodelay(struct sock *sk);
+
+int mptcp_sock_set_syncnt(struct sock *sk, int val);
#else
static inline void mptcp_init(void)
@@ -334,11 +338,18 @@ static inline void mptcp_sock_no_linger(struct sock *sk) { }
static inline void mptcp_sock_set_priority(struct sock *sk, u32 priority) { }
+static inline void __mptcp_sock_set_tos(struct sock *sk, int val) { }
+
static inline void mptcp_sock_set_tos(struct sock *sk) { }
static inline void mptcp_sock_set_reuseaddr(struct sock *sk) { }
static inline void mptcp_sock_set_nodelay(struct sock *sk) { }
+
+static inline int mptcp_sock_set_syncnt(struct sock *sk, int val)
+{
+ return 0;
+}
#endif /* CONFIG_MPTCP */
#if IS_ENABLED(CONFIG_MPTCP_IPV6)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 0adbbe568f6e..7857dac62afc 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1598,6 +1598,8 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
WRITE_ONCE(inet_sk(ssk)->local_port_range, READ_ONCE(inet_sk(sk)->local_port_range));
ssk->sk_reuse = sk->sk_reuse;
+ if (inet_csk(sk)->icsk_syn_retries > 0)
+ tcp_sock_set_syncnt(ssk, inet_csk(sk)->icsk_syn_retries);
}
void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk)
@@ -1709,7 +1711,7 @@ void mptcp_sock_set_priority(struct sock *sk, u32 priority)
}
EXPORT_SYMBOL(mptcp_sock_set_priority);
-static void __mptcp_sock_set_tos(struct sock *sk, int val)
+void __mptcp_sock_set_tos(struct sock *sk, int val)
{
struct mptcp_sock *msk = mptcp_sk(sk);
struct mptcp_subflow_context *subflow;
@@ -1728,6 +1730,7 @@ static void __mptcp_sock_set_tos(struct sock *sk, int val)
}
release_sock(sk);
}
+EXPORT_SYMBOL(__mptcp_sock_set_tos);
void mptcp_sock_set_tos(struct sock *sk)
{
@@ -1783,3 +1786,28 @@ void mptcp_sock_set_nodelay(struct sock *sk)
release_sock(sk);
}
EXPORT_SYMBOL(mptcp_sock_set_nodelay);
+
+int mptcp_sock_set_syncnt(struct sock *sk, int val)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ struct sock *ssk;
+
+ if (val < 1 || val > MAX_TCP_SYNCNT)
+ return -EINVAL;
+
+ lock_sock(sk);
+ sockopt_seq_inc(msk);
+ inet_csk(sk)->icsk_syn_retries = val;
+ mptcp_for_each_subflow(msk, subflow) {
+ ssk = mptcp_subflow_tcp_sock(subflow);
+ if (ssk) {
+ lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
+ tcp_sock_set_syncnt(ssk, val);
+ release_sock(ssk);
+ }
+ }
+ release_sock(sk);
+ return 0;
+}
+EXPORT_SYMBOL(mptcp_sock_set_syncnt);
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 10/11] selftests: mptcp: add nvme over mptcp test
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (8 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 09/11] nvme-tcp: implement host mptcp proto Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
2026-05-28 3:10 ` [PATCH 11/11] selftests: mptcp: nvme: add iopolicy tests Geliang Tang
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
A test case for NVMe over MPTCP has been implemented. It verifies the
proper functionality of nvme discover and connect commands to establish
NVMe over MPTCP connections. The test then evaluates read/write
performance using fio, and ensures proper cleanup with nvme disconnect.
This script accepts two positional parameters:
trtype - Transport type (mptcp|tcp). Default: mptcp
path - Number of multipath (1-4). Default: 1
This test simulates four NICs on both target and host sides, each limited
to 125MB/s. It shows that 'NVMe over MPTCP' delivered bandwidth up to
four times that of standard TCP with a single NVMe multipath configuration:
# ./mptcp_nvme.sh tcp
READ: bw=112MiB/s (118MB/s), 112MiB/s-112MiB/s (118MB/s-118MB/s),
io=1123MiB (1177MB), run=10018-10018msec
WRITE: bw=112MiB/s (117MB/s), 112MiB/s-112MiB/s (117MB/s-117MB/s),
io=1118MiB (1173MB), run=10018-10018msec
# ./mptcp_nvme.sh mptcp
READ: bw=427MiB/s (448MB/s), 427MiB/s-427MiB/s (448MB/s-448MB/s),
io=4286MiB (4494MB), run=10039-10039msec
WRITE: bw=387MiB/s (406MB/s), 387MiB/s-387MiB/s (406MB/s-406MB/s),
io=3885MiB (4073MB), run=10043-10043msec
It reflects that MPTCP has the same multi-interface bandwidth aggregation
capability as NVMe multipath.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
tools/testing/selftests/net/mptcp/Makefile | 1 +
tools/testing/selftests/net/mptcp/config | 8 +
.../testing/selftests/net/mptcp/mptcp_lib.sh | 12 +
.../testing/selftests/net/mptcp/mptcp_nvme.sh | 329 ++++++++++++++++++
4 files changed, 350 insertions(+)
create mode 100755 tools/testing/selftests/net/mptcp/mptcp_nvme.sh
diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/selftests/net/mptcp/Makefile
index 22ba0da2adb8..7b308447a58b 100644
--- a/tools/testing/selftests/net/mptcp/Makefile
+++ b/tools/testing/selftests/net/mptcp/Makefile
@@ -13,6 +13,7 @@ TEST_PROGS := \
mptcp_connect_sendfile.sh \
mptcp_connect_splice.sh \
mptcp_join.sh \
+ mptcp_nvme.sh \
mptcp_sockopt.sh \
pm_netlink.sh \
simult_flows.sh \
diff --git a/tools/testing/selftests/net/mptcp/config b/tools/testing/selftests/net/mptcp/config
index 59051ee2a986..e59cf7398f19 100644
--- a/tools/testing/selftests/net/mptcp/config
+++ b/tools/testing/selftests/net/mptcp/config
@@ -34,3 +34,11 @@ CONFIG_NFT_SOCKET=m
CONFIG_NFT_TPROXY=m
CONFIG_SYN_COOKIES=y
CONFIG_VETH=y
+CONFIG_BLK_DEV_LOOP=y
+CONFIG_CONFIGFS_FS=y
+CONFIG_NVME_CORE=y
+CONFIG_NVME_FABRICS=y
+CONFIG_NVME_TCP=y
+CONFIG_NVME_TARGET=y
+CONFIG_NVME_TARGET_TCP=y
+CONFIG_NVME_MULTIPATH=y
diff --git a/tools/testing/selftests/net/mptcp/mptcp_lib.sh b/tools/testing/selftests/net/mptcp/mptcp_lib.sh
index 5ef6033775c8..e08854ba42bd 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_lib.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_lib.sh
@@ -530,6 +530,18 @@ mptcp_lib_check_tools() {
exit ${KSFT_SKIP}
fi
;;
+ "nvme")
+ if ! nvme --version &> /dev/null; then
+ mptcp_lib_pr_skip "nvme tool not found"
+ exit ${KSFT_SKIP}
+ fi
+ ;;
+ "fio")
+ if ! fio -h &> /dev/null; then
+ mptcp_lib_pr_skip "fio tool not found"
+ exit ${KSFT_SKIP}
+ fi
+ ;;
*)
mptcp_lib_pr_fail "Internal error: unsupported tool: ${tool}"
exit ${KSFT_FAIL}
diff --git a/tools/testing/selftests/net/mptcp/mptcp_nvme.sh b/tools/testing/selftests/net/mptcp/mptcp_nvme.sh
new file mode 100755
index 000000000000..5b1133dbc2d5
--- /dev/null
+++ b/tools/testing/selftests/net/mptcp/mptcp_nvme.sh
@@ -0,0 +1,329 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(dirname "$0")/mptcp_lib.sh"
+
+ret=0
+trtype="${1:-mptcp}"
+path="${2:-1}"
+nqn="nqn.2014-08.org.nvmexpress.${trtype}dev.$$.${RANDOM}"
+ns=1
+port=$((RANDOM % 10000 + 20000))
+trsvcid=$((RANDOM % 64512 + 1024))
+ns1=""
+ns2=""
+temp_file=""
+loop_dev=""
+
+export trtype path nqn ns port trsvcid
+export loop_dev temp_file
+
+usage()
+{
+ cat << EOF
+
+Usage:
+
+ $(basename "$0") [trtype] [path]
+
+ trtype Transport type (tcp|mptcp) - default: mptcp
+ path Number of multipath (1-4) - default: 1
+
+EOF
+exit ${KSFT_FAIL}
+}
+
+validate_params()
+{
+ if [[ ! "${trtype}" =~ ^(tcp|mptcp)$ ]]; then
+ echo "Invalid trtype ${trtype}. Must be tcp or mptcp"
+ usage
+ fi
+
+ if [[ ! "${path}" =~ ^[1-4]$ ]]; then
+ echo "Invalid path count ${path}. Must be between 1 and 4"
+ usage
+ fi
+}
+
+# This function is invoked indirectly
+#shellcheck disable=SC2317,SC2329
+ns1_cleanup()
+{
+ pushd /sys/kernel/config/nvmet || exit 1
+
+ for i in $(seq 1 "${path}"); do
+ local portdir=$((port + i))
+
+ rm -rf "ports/${portdir}/subsystems/${nqn}"
+ rmdir "ports/${portdir}"
+ done
+
+ echo 0 > "subsystems/${nqn}/namespaces/${ns}/enable"
+ rmdir "subsystems/${nqn}/namespaces/${ns}"
+ rmdir "subsystems/${nqn}"
+
+ popd || exit 1
+}
+
+# This function is invoked indirectly
+#shellcheck disable=SC2317,SC2329
+ns2_cleanup()
+{
+ nvme disconnect -n "${nqn}" || true
+}
+
+# This function is used in the cleanup trap
+#shellcheck disable=SC2317,SC2329
+cleanup()
+{
+ if ! ip netns exec "$ns2" bash <<- EOF
+ $(declare -f ns2_cleanup)
+ ns2_cleanup
+ EOF
+ then
+ echo "ns2_cleanup failed" >&2
+ fi
+
+ sleep 1
+
+ if ! ip netns exec "$ns1" unshare -m bash <<- EOF
+ mount -t configfs none /sys/kernel/config
+ $(declare -f ns1_cleanup)
+ ns1_cleanup
+ EOF
+ then
+ echo "ns1_cleanup failed" >&2
+ fi
+
+ if [ -n "${loop_dev}" ] && [ -b "${loop_dev}" ]; then
+ losetup -d "${loop_dev}" 2>/dev/null || true
+ fi
+ rm -rf "${temp_file}"
+
+ mptcp_lib_ns_exit "$ns1" "$ns2"
+
+ unset -v trtype path nqn ns port trsvcid
+ unset -v loop_dev temp_file
+}
+
+# $tc_args needs word splitting to pass multiple arguments to netem
+# shellcheck disable=SC2086
+init()
+{
+ local tc_args="rate 1000mbit"
+
+ mptcp_lib_ns_init ns1 ns2
+
+ # ns1 ns2
+ # 10.1.1.1 10.1.1.2
+ # 10.1.2.1 10.1.2.2
+ # 10.1.3.1 10.1.3.2
+ # 10.1.4.1 10.1.4.2
+ for i in {1..4}; do
+ ip link add ns1eth"$i" netns "$ns1" type veth peer \
+ name ns2eth"$i" netns "$ns2"
+ ip -net "$ns1" addr add 10.1."$i".1/24 dev ns1eth"$i"
+ ip -net "$ns1" addr add dead:beef:"$i"::1/64 \
+ dev ns1eth"$i" nodad
+ ip -net "$ns1" link set ns1eth"$i" up
+ ip -net "$ns2" addr add 10.1."$i".2/24 dev ns2eth"$i"
+ ip -net "$ns2" addr add dead:beef:"$i"::2/64 \
+ dev ns2eth"$i" nodad
+ ip -net "$ns2" link set ns2eth"$i" up
+ ip -net "$ns2" route add default via 10.1."$i".1 \
+ dev ns2eth"$i" metric 10"$i"
+ ip -net "$ns2" route add default via dead:beef:"$i"::1 \
+ dev ns2eth"$i" metric 10"$i"
+
+ # Add tc qdisc to both namespaces for bandwidth limiting
+ tc -n "$ns1" qdisc add dev ns1eth"$i" root netem $tc_args
+ tc -n "$ns2" qdisc add dev ns2eth"$i" root netem $tc_args
+
+ tc -n "$ns1" qdisc show dev ns1eth"$i"
+ tc -n "$ns2" qdisc show dev ns2eth"$i"
+ done
+
+ mptcp_lib_pm_nl_set_limits "${ns1}" 8 8
+
+ mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.1.1 flags signal
+ mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.2.1 flags signal
+ mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.3.1 flags signal
+ mptcp_lib_pm_nl_add_endpoint "$ns1" 10.1.4.1 flags signal
+
+ mptcp_lib_pm_nl_set_limits "${ns2}" 8 8
+
+ mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.1.2 flags subflow
+ mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.2.2 flags subflow
+ mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.3.2 flags subflow
+ mptcp_lib_pm_nl_add_endpoint "$ns2" 10.1.4.2 flags subflow
+}
+
+# This function is invoked indirectly
+#shellcheck disable=SC2317,SC2329
+run_target()
+{
+ cd /sys/kernel/config/nvmet/subsystems || exit
+ mkdir -p "${nqn}"
+ cd "${nqn}" || exit
+ echo 1 > attr_allow_any_host
+ mkdir -p namespaces/"${ns}"
+ echo "${loop_dev}" > namespaces/"${ns}"/device_path
+ echo 1 > namespaces/"${ns}"/enable
+
+ # Create ${path} ports, each on a different IP address
+ for i in $(seq 1 "${path}"); do
+ local portdir=$((port + i))
+
+ cd /sys/kernel/config/nvmet/ports || exit
+ mkdir -p "${portdir}"
+ cd "${portdir}" || exit 1
+ echo "${trtype}" > addr_trtype
+ echo ipv4 > addr_adrfam
+ if [ "${path}" -eq 1 ]; then
+ echo "0.0.0.0" > addr_traddr
+ else
+ echo "10.1.${i}.1" > addr_traddr
+ fi
+ echo "${trsvcid}" > addr_trsvcid
+
+ mkdir -p subsystems
+ ln -sf "../../subsystems/${nqn}" "subsystems/${nqn}"
+ cd - >/dev/null || exit
+ done
+}
+
+# This function is invoked indirectly
+#shellcheck disable=SC2317,SC2329
+run_host()
+{
+ local traddr=10.1.1.1
+ local devname
+
+ echo "nvme discover -a ${traddr}"
+ if ! nvme discover -t "${trtype}" -a "${traddr}" \
+ -s "${trsvcid}"; then
+ echo "Failed to discover ${traddr}"
+ return 1
+ fi
+
+ for i in $(seq 1 "${path}"); do
+ traddr=10.1.${i}.1
+ echo "Connecting to ${traddr}:${trsvcid}"
+ if ! nvme connect -t "${trtype}" -a "${traddr}" \
+ -s "${trsvcid}" -n "${nqn}"; then
+ echo "Failed to connect to ${traddr}"
+ return 1
+ fi
+ done
+
+ for i in $(seq 1 10); do
+ for dev in /dev/nvme*n1; do
+ if [ -b "$dev" ] 2>/dev/null; then
+ if nvme id-ctrl "$dev" 2>/dev/null |
+ grep -q "${nqn}"; then
+ devname=$(basename "$dev")
+ break 2
+ fi
+ fi
+ done 2>/dev/null
+ [ -n "$devname" ] && break
+ sleep 1
+ done
+
+ if [ -z "$devname" ]; then
+ echo "No block device found for NQN ${nqn}" >&2
+ return 1
+ fi
+
+ echo "nvme list"
+ if ! nvme list; then
+ echo "nvme list failed" >&2
+ return 1
+ fi
+
+ sleep 1
+
+ echo "fio randread /dev/${devname}"
+ if ! fio --name=global --direct=1 --norandommap --randrepeat=0 \
+ --ioengine=libaio --thread=1 --blocksize=128k --runtime=10 \
+ --time_based --rw=randread --numjobs=4 --iodepth=256 \
+ --group_reporting --size=100% \
+ --name=libaio_4_256_128k_randread \
+ --filename="/dev/${devname}"; then
+ echo "fio randread failed"
+ return 1
+ fi
+
+ sleep 1
+
+ echo "fio randwrite /dev/${devname}"
+ if ! fio --name=global --direct=1 --norandommap --randrepeat=0 \
+ --ioengine=libaio --thread=1 --blocksize=128k --runtime=10 \
+ --time_based --rw=randwrite --numjobs=4 --iodepth=256 \
+ --group_reporting --size=100% \
+ --name=libaio_4_256_128k_randwrite \
+ --filename="/dev/${devname}"; then
+ echo "fio randwrite failed"
+ return 1
+ fi
+
+ nvme flush "/dev/${devname}"
+}
+
+mptcp_lib_check_tools nvme fio
+validate_params
+
+if ! temp_file=$(mktemp --suffix=.raw /tmp/nvme_test.XXXXXX); then
+ echo "Failed to create temp file"
+ exit 1
+fi
+
+trap cleanup EXIT
+
+if ! dd if=/dev/zero of="${temp_file}" bs=1M count=0 seek=512; then
+ echo "Failed to create backing file" >&2
+ exit 1
+fi
+
+if ! loop_dev=$(losetup -f --show "${temp_file}"); then
+ echo "Failed to create loop device" >&2
+ exit 1
+fi
+
+init
+
+run_test()
+{
+ if ! ip netns exec "$ns1" unshare -m bash <<- EOF
+ mount -t configfs none /sys/kernel/config
+ $(declare -f run_target)
+ run_target
+ exit \$?
+ EOF
+ then
+ ret="${KSFT_FAIL}"
+ fi
+
+ if ! ip netns exec "$ns2" bash <<- EOF
+ $(declare -f run_host)
+ run_host
+ exit \$?
+ EOF
+ then
+ ret="${KSFT_FAIL}"
+ fi
+
+ sleep 1
+}
+
+run_test "$@"
+
+if [ "${ret}" -eq 0 ]; then
+ mptcp_lib_result_pass "nvme over ${trtype} test"
+else
+ mptcp_lib_result_fail "nvme over ${trtype} test"
+fi
+
+mptcp_lib_result_print_all_tap
+exit "$ret"
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 11/11] selftests: mptcp: nvme: add iopolicy tests
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
` (9 preceding siblings ...)
2026-05-28 3:10 ` [PATCH 10/11] selftests: mptcp: add nvme over mptcp test Geliang Tang
@ 2026-05-28 3:10 ` Geliang Tang
10 siblings, 0 replies; 12+ messages in thread
From: Geliang Tang @ 2026-05-28 3:10 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Matthieu Baerts, Mat Martineau, Geliang Tang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Shuah Khan
Cc: Geliang Tang, linux-nvme, netdev, mptcp, linux-kselftest,
Hannes Reinecke, John Meneghini, Randy Jennings, Nilay Shroff,
zhenwei pi, Hui Zhu, Gang Yan
From: Geliang Tang <tanggeliang@kylinos.cn>
Add NVMe iopolicy testing to mptcp_nvme.sh, with the default set to
"numa". It can be set to "round-robin" or "queue-depth".
Test results with 4 NVMe multipath paths and round-robin iopolicy show
that TCP and MPTCP achieve similar bandwidth:
# ./mptcp_nvme.sh tcp 4 round-robin
READ: bw=455MiB/s (478MB/s), 455MiB/s-455MiB/s (478MB/s-478MB/s),
io=4665MiB (4891MB), run=10242-10242msec
WRITE: bw=455MiB/s (477MB/s), 455MiB/s-455MiB/s (477MB/s-477MB/s),
io=4633MiB (4858MB), run=10184-10184msec
# ./mptcp_nvme.sh mptcp 4 round-robin
READ: bw=445MiB/s (466MB/s), 445MiB/s-445MiB/s (466MB/s-466MB/s),
io=4575MiB (4797MB), run=10287-10287msec
WRITE: bw=445MiB/s (467MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
io=4572MiB (4794MB), run=10267-10267msec
A "loss" argument is added to simulate network packet loss. When loss=1,
each veth interface is configured with "delay 5ms loss 0.5%" using tc
qdisc. Under this scenario, TCP performance is reduced by multiples
compared to MPTCP:
# ./mptcp_nvme.sh tcp 4 round-robin 1
READ: bw=144MiB/s (151MB/s), 144MiB/s-144MiB/s (151MB/s-151MB/s),
io=1909MiB (2001MB), run=13231-13231msec
WRITE: bw=100.0MiB/s (105MB/s), 100.0MiB/s-100.0MiB/s (105MB/s-105MB/s),
io=1397MiB (1465MB), run=13980-13980msec
# ./mptcp_nvme.sh mptcp 4 round-robin 1
READ: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s),
io=4524MiB (4743MB), run=10564-10564msec
WRITE: bw=431MiB/s (452MB/s), 431MiB/s-431MiB/s (452MB/s-452MB/s),
io=4513MiB (4732MB), run=10481-10481msec
These results demonstrate that MPTCP has better resilience against
packet loss compared to TCP, as it can leverage multiple subflows to
mitigate network degradation.
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Meneghini <jmeneghi@redhat.com>
Cc: Randy Jennings <randyj@purestorage.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Co-developed-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Co-developed-by: Hui Zhu <zhuhui@kylinos.cn>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Co-developed-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
.../testing/selftests/net/mptcp/mptcp_nvme.sh | 70 ++++++++++++++++++-
1 file changed, 69 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_nvme.sh b/tools/testing/selftests/net/mptcp/mptcp_nvme.sh
index 5b1133dbc2d5..3ab04be05dff 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_nvme.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_nvme.sh
@@ -6,6 +6,8 @@
ret=0
trtype="${1:-mptcp}"
path="${2:-1}"
+iopolicy=${3:-"numa"} # round-robin, queue-depth
+loss=${4:-0}
nqn="nqn.2014-08.org.nvmexpress.${trtype}dev.$$.${RANDOM}"
ns=1
port=$((RANDOM % 10000 + 20000))
@@ -17,6 +19,7 @@ loop_dev=""
export trtype path nqn ns port trsvcid
export loop_dev temp_file
+export iopolicy loss
usage()
{
@@ -24,10 +27,12 @@ usage()
Usage:
- $(basename "$0") [trtype] [path]
+ $(basename "$0") [trtype] [path] [iopolicy] [loss]
trtype Transport type (tcp|mptcp) - default: mptcp
path Number of multipath (1-4) - default: 1
+ iopolicy I/O policy (numa|round-robin|queue-depth) - default: numa
+ loss Enable packet loss (0|1) - default: 0
EOF
exit ${KSFT_FAIL}
@@ -44,6 +49,16 @@ validate_params()
echo "Invalid path count ${path}. Must be between 1 and 4"
usage
fi
+
+ if [[ ! "${iopolicy}" =~ ^(numa|round-robin|queue-depth)$ ]]; then
+ echo "Invalid iopolicy ${iopolicy}."
+ usage
+ fi
+
+ if [[ ! "${loss}" =~ ^[01]$ ]]; then
+ echo "Invalid loss value ${loss}. Must be 0 or 1"
+ usage
+ fi
}
# This function is invoked indirectly
@@ -105,6 +120,7 @@ cleanup()
unset -v trtype path nqn ns port trsvcid
unset -v loop_dev temp_file
+ unset -v iopolicy loss
}
# $tc_args needs word splitting to pass multiple arguments to netem
@@ -113,6 +129,10 @@ init()
{
local tc_args="rate 1000mbit"
+ if [ "${loss}" -eq 1 ]; then
+ tc_args+=" delay 5ms loss 0.5%"
+ fi
+
mptcp_lib_ns_init ns1 ns2
# ns1 ns2
@@ -193,6 +213,48 @@ run_target()
done
}
+# This function is invoked indirectly
+#shellcheck disable=SC2317,SC2329
+set_io_policy()
+{
+ local nqn="$1"
+ local iopolicy="$2"
+ local subname
+ local policy
+ local current
+
+ subname=$(nvme list-subsys 2>/dev/null | grep "${nqn}" |
+ grep -o 'nvme-subsys[0-9]*' | head -1)
+ if [ -z "$subname" ]; then
+ return 1
+ fi
+
+ policy="/sys/class/nvme-subsystem/${subname}/iopolicy"
+ if [ ! -e "$policy" ]; then
+ # NVMe multipath not supported, skip iopolicy setting
+ return 0
+ fi
+
+ if [ ! -w "$policy" ]; then
+ return 1
+ fi
+
+ if ! echo "${iopolicy}" > "$policy" 2>/dev/null; then
+ return 1
+ fi
+
+ current=$(cat "$policy" 2>/dev/null)
+ if [ -z "$current" ]; then
+ return 1
+ fi
+
+ if [[ "$current" != *"${iopolicy}"* ]]; then
+ return 1
+ fi
+
+ return 0
+}
+
# This function is invoked indirectly
#shellcheck disable=SC2317,SC2329
run_host()
@@ -242,6 +304,11 @@ run_host()
return 1
fi
+ if ! set_io_policy "${nqn}" "${iopolicy}"; then
+ echo "Failed to set I/O policy to ${iopolicy}"
+ return 1
+ fi
+
sleep 1
echo "fio randread /dev/${devname}"
@@ -306,6 +373,7 @@ run_test()
fi
if ! ip netns exec "$ns2" bash <<- EOF
+ $(declare -f set_io_policy)
$(declare -f run_host)
run_host
exit \$?
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-05-28 3:12 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 3:10 [PATCH 00/11] NVMe over MPTCP Geliang Tang
2026-05-28 3:10 ` [PATCH 01/11] nvmet-tcp: define accept tcp_proto struct Geliang Tang
2026-05-28 3:10 ` [PATCH 02/11] nvmet-tcp: implement accept mptcp proto Geliang Tang
2026-05-28 3:10 ` [PATCH 03/11] nvmet-tcp: define listen socket ops Geliang Tang
2026-05-28 3:10 ` [PATCH 04/11] nvmet-tcp: register target mptcp transport Geliang Tang
2026-05-28 3:10 ` [PATCH 05/11] nvmet-tcp: implement mptcp listen socket ops Geliang Tang
2026-05-28 3:10 ` [PATCH 06/11] nvme-fabrics: compare transport in ip_options_match Geliang Tang
2026-05-28 3:10 ` [PATCH 07/11] nvme-tcp: define host tcp_proto struct Geliang Tang
2026-05-28 3:10 ` [PATCH 08/11] nvme-tcp: register host mptcp transport Geliang Tang
2026-05-28 3:10 ` [PATCH 09/11] nvme-tcp: implement host mptcp proto Geliang Tang
2026-05-28 3:10 ` [PATCH 10/11] selftests: mptcp: add nvme over mptcp test Geliang Tang
2026-05-28 3:10 ` [PATCH 11/11] selftests: mptcp: nvme: add iopolicy tests Geliang Tang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox