public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments
@ 2026-01-29  4:13 Geliang Tang
  2026-02-25  5:57 ` Ming Lei
  2026-02-25 15:07 ` Nilay Shroff
  0 siblings, 2 replies; 6+ messages in thread
From: Geliang Tang @ 2026-01-29  4:13 UTC (permalink / raw)
  To: lsf-pc, linux-nvme
  Cc: mptcp, Matthieu Baerts, Mat Martineau, Paolo Abeni,
	Hannes Reinecke

As one of the MPTCP upstream developers, I'm recently working on adding
MPTCP support to 'NVMe over TCP'. This approach achieves a multi-fold
performance improvement over using standard TCP. The implementation and
testing phases are largely complete. The code is currently in the RFC
stage and has undergone several rounds of discussion and iteration on
the MPTCP mailing list [1]. It will be sent to the NVMe mailing list
shortly.

1. Introduction to MPTCP

Multipath TCP (MPTCP), standardized in RFC 8684, represents a major
evolution of the TCP protocol. It enables a single transport connection
to utilize multiple network paths simultaneously, providing benefits in
redundancy, resilience, and bandwidth aggregation. Since its
introduction in Linux kernel v5.6, it has become a key technology for
modern networking, particularly in multi-NIC environments.

On a supported system such as Linux, an MPTCP socket is created by
specifying the IPPROTO_MPTCP protocol in the socket() system call:

	int fd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

This creates a socket that appears as a standard TCP socket to the
application but uses the MPTCP protocol stack underneath.

For more details, please visit the project website: https://mptcp.dev.

2. Implementation

'NVMe over TCP' establishes multiple TCP connections between the target
and host for data transfer. This includes one admin queue connection
for management traffic and multiple I/O queue connections for data
traffic, with the number typically scaling with available CPU cores.
While these multiple TCP connections (using the same IP address but
different ports) help distribute computational load across CPUs, all
data traffic still flows through a single network interface card (NIC),
even in multi-NIC environments.

The 'NVMe over MPTCP' solution enhances 'NVMe over TCP' by replacing
the multiple TCP connections with multiple MPTCP connections, leaving
other mechanisms unchanged. Internally, each MPTCP connection can
establish multiple subflows based on the number of configured NICs.
This distributes data traffic across all available NICs, thereby
increasing aggregate transmission speed.

Therefore, the primary change required is to modify the protocol
parameter from IPPROTO_TCP to IPPROTO_MPTCP when creating sockets on
both the target and host sides:

	Target side:

	sock_create(port->addr.ss_family, SOCK_STREAM,
			IPPROTO_TCP, &port->sock);

	Host side:

	sock_create_kern(current->nsproxy->net_ns,
			ctrl->addr.ss_family, SOCK_STREAM,
			IPPROTO_TCP, &queue->sock);

A new NVMe transport type, named NVMF_TRTYPE_MPTCP (suggested by Hannes
Reinecke), has been introduced to determine whether to create a TCP or
MPTCP socket:

	Target side:

	if (nport->disc_addr.trtype == NVMF_TRTYPE_MPTCP)
		proto = IPPROTO_MPTCP;

	Host side:

	if (!strcmp(ctrl->ctrl.opts->transport, "mptcp"))
		proto = IPPROTO_MPTCP;

3. Performance Benefits

This new feature has been evaluated in different environments:

I conducted 'NVMe over MPTCP' tests between two PCs, each equipped with
two Gigabit NICs and directly connected via Ethernet cables. Using
'NVMe over TCP', the fio benchmark showed a speed of approximately 100
MiB/s. In contrast, 'NVMe over MPTCP' achieved about 200 MiB/s with
fio, doubling the throughput.

In a virtual machine test environment simulating four NICs on both
sides, 'NVMe over MPTCP' delivered bandwidth up to four times that of
standard TCP.

4. Configuration

To achieve the described multi-fold acceleration benefits, both the
target and host sides must be deployed in multi-NIC environments with
properly configured MPTCP endpoints. The target side should use the
'signal' flag for its endpoints, while the host side should use the
'subflow' flag.

	Target side:

	# ip mptcp endpoint add 192.168.1.2 id 2 dev enp3s0f1 signal
	# echo mptcp > /sys/kernel/config/nvmet/ports/1234/addr_trtype

	Host side:

	# ip mptcp endpoint add 192.168.1.4 id 2 dev enp1s0f1 subflow
	# nvme discover -t mptcp ...
	# nvme connect -t mptcp ...

5. Dependencies

The modifications in the NVMe subsystem are minimal. Most of the code
change is on the MPTCP side, to implement interfaces to use MPTCP from
the kernel space, similar to what is done today with TCP.

'NVMe over TCP' uses the read_sock interface for data receiving.
Consequently, the read_sock interface for MPTCP has been implemented
('implement mptcp read_sock' series [2]), which is under review on the
MPTCP mailing list.

As 'NVMe over TCP' can utilize TLS for encryption, KTLS support for
MPTCP has also been added ('MPTCP KTLS support' series [3]), which is
currently in the RFC stage. TLS mode for 'NVMe over MPTCP' has been
successfully validated.

Corresponding updates are also required in user-space libraries and
tools, including libnvme [4], nvme-cli [5], and ktls-utils [6], to add
MPTCP support.

6. Discussion

The current approach is to define a new transport type,
NVMF_TRTYPE_MPTCP, but I understand it will need to register a new
protocol number from NVMexpress.org. Hannes Reinecke also suggested
declaring MPTCP as a TCP 'variant', but I found some drawbacks. I would
like to discuss them and find possible solutions.

I also seek guidance on how to incorporate MPTCP support into the NVMe
protocol specifications. I lack experience in modifying NVMe protocol
specifications and would appreciate guidance and assistance from the
NVMe community.

Thanks,
-Geliang

[1]
NVME over MPTCP
https://patchwork.kernel.org/project/mptcp/cover/cover.1764152990.git.tanggeliang@kylinos.cn/
[2]
implement mptcp read_sock
https://patchwork.kernel.org/project/mptcp/cover/cover.1765023923.git.tanggeliang@kylinos.cn/
[3]
MPTCP KTLS support
https://patchwork.kernel.org/project/mptcp/cover/cover.1768294706.git.tanggeliang@kylinos.cn/
[4]
libnvme: add mptcp trtype
https://patchwork.kernel.org/project/mptcp/patch/99f6e63b5c9677f29a9bc8cdd87b2064b258435f.1764206766.git.tanggeliang@kylinos.cn/
[5]
fabrics: add mptcp support
https://github.com/linux-nvme/nvme-cli/commit/f468531d0592ad22b71760d883409363b1f8a9d6
[6]
add mptcp support
https://github.com/oracle/ktls-utils/commit/4a45e486c65be986ef349ed10b0fc9bd5dbf107d



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments
  2026-01-29  4:13 [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments Geliang Tang
@ 2026-02-25  5:57 ` Ming Lei
  2026-02-26  9:44   ` Geliang Tang
  2026-02-25 15:07 ` Nilay Shroff
  1 sibling, 1 reply; 6+ messages in thread
From: Ming Lei @ 2026-02-25  5:57 UTC (permalink / raw)
  To: Geliang Tang
  Cc: lsf-pc, linux-nvme, mptcp, Matthieu Baerts, Mat Martineau,
	Paolo Abeni, Hannes Reinecke

Hi Geliang,

Looks one interesting topic!

On Thu, Jan 29, 2026 at 12:13:25PM +0800, Geliang Tang wrote:
> As one of the MPTCP upstream developers, I'm recently working on adding
> MPTCP support to 'NVMe over TCP'. This approach achieves a multi-fold
> performance improvement over using standard TCP. The implementation and
> testing phases are largely complete. The code is currently in the RFC
> stage and has undergone several rounds of discussion and iteration on
> the MPTCP mailing list [1]. It will be sent to the NVMe mailing list
> shortly.
> 
> 1. Introduction to MPTCP
> 
> Multipath TCP (MPTCP), standardized in RFC 8684, represents a major
> evolution of the TCP protocol. It enables a single transport connection
> to utilize multiple network paths simultaneously, providing benefits in
> redundancy, resilience, and bandwidth aggregation. Since its
> introduction in Linux kernel v5.6, it has become a key technology for
> modern networking, particularly in multi-NIC environments.
> 
> On a supported system such as Linux, an MPTCP socket is created by
> specifying the IPPROTO_MPTCP protocol in the socket() system call:
> 
> 	int fd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
> 
> This creates a socket that appears as a standard TCP socket to the
> application but uses the MPTCP protocol stack underneath.
> 
> For more details, please visit the project website: https://mptcp.dev.
> 
> 2. Implementation
> 
> 'NVMe over TCP' establishes multiple TCP connections between the target
> and host for data transfer. This includes one admin queue connection
> for management traffic and multiple I/O queue connections for data
> traffic, with the number typically scaling with available CPU cores.
> While these multiple TCP connections (using the same IP address but
> different ports) help distribute computational load across CPUs, all
> data traffic still flows through a single network interface card (NIC),
> even in multi-NIC environments.
> 
> The 'NVMe over MPTCP' solution enhances 'NVMe over TCP' by replacing
> the multiple TCP connections with multiple MPTCP connections, leaving
> other mechanisms unchanged. Internally, each MPTCP connection can
> establish multiple subflows based on the number of configured NICs.
> This distributes data traffic across all available NICs, thereby
> increasing aggregate transmission speed.

NVMe supports multipath, which can apply load balance or sort of algorithm
to maximize network link/bandwidth too.

Maybe you can compare mptcp with multipath in this viewpoint.



Thanks,
Ming



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments
  2026-01-29  4:13 [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments Geliang Tang
  2026-02-25  5:57 ` Ming Lei
@ 2026-02-25 15:07 ` Nilay Shroff
  2026-02-26  9:54   ` Geliang Tang
  1 sibling, 1 reply; 6+ messages in thread
From: Nilay Shroff @ 2026-02-25 15:07 UTC (permalink / raw)
  To: Geliang Tang, lsf-pc, linux-nvme
  Cc: mptcp, Matthieu Baerts, Mat Martineau, Paolo Abeni,
	Hannes Reinecke



On 1/29/26 9:43 AM, Geliang Tang wrote:
> 3. Performance Benefits
> 
> This new feature has been evaluated in different environments:
> 
> I conducted 'NVMe over MPTCP' tests between two PCs, each equipped with
> two Gigabit NICs and directly connected via Ethernet cables. Using
> 'NVMe over TCP', the fio benchmark showed a speed of approximately 100
> MiB/s. In contrast, 'NVMe over MPTCP' achieved about 200 MiB/s with
> fio, doubling the throughput.
> 
> In a virtual machine test environment simulating four NICs on both
> sides, 'NVMe over MPTCP' delivered bandwidth up to four times that of
> standard TCP.

This is interesting. Did you try using an NVMe multipath iopolicy other
than the default numa policy? Assuming both the host and target are multihomed,
configuring round-robin or queue-depth may provide performance comparable
to what you are seeing with MPTCP.

I think MPTCP shall distribute traffic using transport-level metrics such as
RTT, cwnd, and packet loss, whereas the NVMe multipath layer makes decisions
based on ANA state, queue depth, and NUMA locality. In a setup with multiple
active paths, switching the iopolicy from numa to round-robin or queue-depth
could improve load distribution across controllers and thus improve performance.

IMO, it would be useful to test with those policies and compare the results
against the MPTCP setup.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments
  2026-02-25  5:57 ` Ming Lei
@ 2026-02-26  9:44   ` Geliang Tang
  0 siblings, 0 replies; 6+ messages in thread
From: Geliang Tang @ 2026-02-26  9:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: lsf-pc, linux-nvme, mptcp, Matthieu Baerts, Mat Martineau,
	Paolo Abeni, Hannes Reinecke

Hi Ming,

On Wed, 2026-02-25 at 13:57 +0800, Ming Lei wrote:
> Hi Geliang,
> 
> Looks one interesting topic!

Thanks for your reply.

> 
> On Thu, Jan 29, 2026 at 12:13:25PM +0800, Geliang Tang wrote:
> > As one of the MPTCP upstream developers, I'm recently working on
> > adding
> > MPTCP support to 'NVMe over TCP'. This approach achieves a multi-
> > fold
> > performance improvement over using standard TCP. The implementation
> > and
> > testing phases are largely complete. The code is currently in the
> > RFC
> > stage and has undergone several rounds of discussion and iteration
> > on
> > the MPTCP mailing list [1]. It will be sent to the NVMe mailing
> > list
> > shortly.
> > 
> > 1. Introduction to MPTCP
> > 
> > Multipath TCP (MPTCP), standardized in RFC 8684, represents a major
> > evolution of the TCP protocol. It enables a single transport
> > connection
> > to utilize multiple network paths simultaneously, providing
> > benefits in
> > redundancy, resilience, and bandwidth aggregation. Since its
> > introduction in Linux kernel v5.6, it has become a key technology
> > for
> > modern networking, particularly in multi-NIC environments.
> > 
> > On a supported system such as Linux, an MPTCP socket is created by
> > specifying the IPPROTO_MPTCP protocol in the socket() system call:
> > 
> > 	int fd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
> > 
> > This creates a socket that appears as a standard TCP socket to the
> > application but uses the MPTCP protocol stack underneath.
> > 
> > For more details, please visit the project website:
> > https://mptcp.dev.
> > 
> > 2. Implementation
> > 
> > 'NVMe over TCP' establishes multiple TCP connections between the
> > target
> > and host for data transfer. This includes one admin queue
> > connection
> > for management traffic and multiple I/O queue connections for data
> > traffic, with the number typically scaling with available CPU
> > cores.
> > While these multiple TCP connections (using the same IP address but
> > different ports) help distribute computational load across CPUs,
> > all
> > data traffic still flows through a single network interface card
> > (NIC),
> > even in multi-NIC environments.
> > 
> > The 'NVMe over MPTCP' solution enhances 'NVMe over TCP' by
> > replacing
> > the multiple TCP connections with multiple MPTCP connections,
> > leaving
> > other mechanisms unchanged. Internally, each MPTCP connection can
> > establish multiple subflows based on the number of configured NICs.
> > This distributes data traffic across all available NICs, thereby
> > increasing aggregate transmission speed.
> 
> NVMe supports multipath, which can apply load balance or sort of
> algorithm
> to maximize network link/bandwidth too.
> 
> Maybe you can compare mptcp with multipath in this viewpoint.

Indeed worth comparing. Although they work at different layers, their
goals share similarities. I'll compare them in a follow-up and get back
to you.

Thanks,
-Geliang

> 
> 
> 
> Thanks,
> Ming
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments
  2026-02-25 15:07 ` Nilay Shroff
@ 2026-02-26  9:54   ` Geliang Tang
  2026-03-05  4:30     ` Geliang Tang
  0 siblings, 1 reply; 6+ messages in thread
From: Geliang Tang @ 2026-02-26  9:54 UTC (permalink / raw)
  To: Nilay Shroff, lsf-pc, linux-nvme
  Cc: mptcp, Matthieu Baerts, Mat Martineau, Paolo Abeni,
	Hannes Reinecke

Hi Nilay,

Thanks for your reply.

On Wed, 2026-02-25 at 20:37 +0530, Nilay Shroff wrote:
> 
> 
> On 1/29/26 9:43 AM, Geliang Tang wrote:
> > 3. Performance Benefits
> > 
> > This new feature has been evaluated in different environments:
> > 
> > I conducted 'NVMe over MPTCP' tests between two PCs, each equipped
> > with
> > two Gigabit NICs and directly connected via Ethernet cables. Using
> > 'NVMe over TCP', the fio benchmark showed a speed of approximately
> > 100
> > MiB/s. In contrast, 'NVMe over MPTCP' achieved about 200 MiB/s with
> > fio, doubling the throughput.
> > 
> > In a virtual machine test environment simulating four NICs on both
> > sides, 'NVMe over MPTCP' delivered bandwidth up to four times that
> > of
> > standard TCP.
> 
> This is interesting. Did you try using an NVMe multipath iopolicy
> other
> than the default numa policy? Assuming both the host and target are
> multihomed,
> configuring round-robin or queue-depth may provide performance
> comparable
> to what you are seeing with MPTCP.
> 
> I think MPTCP shall distribute traffic using transport-level metrics
> such as
> RTT, cwnd, and packet loss, whereas the NVMe multipath layer makes
> decisions
> based on ANA state, queue depth, and NUMA locality. In a setup with
> multiple
> active paths, switching the iopolicy from numa to round-robin or
> queue-depth
> could improve load distribution across controllers and thus improve
> performance.
> 
> IMO, it would be useful to test with those policies and compare the
> results
> against the MPTCP setup.

Ming Lei also made a similar comment. In my experiments, I didn't set
the multipath iopolicy, so I was using the default numa policy. In the
follow-up, I'll adjust it to round-robin or queue-depth and rerun the
experiments. I'll share the results in this email thread.

Thanks,
-Geliang

> 
> Thanks,
> --Nilay


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments
  2026-02-26  9:54   ` Geliang Tang
@ 2026-03-05  4:30     ` Geliang Tang
  0 siblings, 0 replies; 6+ messages in thread
From: Geliang Tang @ 2026-03-05  4:30 UTC (permalink / raw)
  To: Nilay Shroff, lsf-pc, linux-nvme, Ming Lei, Javier González
  Cc: mptcp, Matthieu Baerts, Mat Martineau, Paolo Abeni,
	Hannes Reinecke

Hi Nilay, Ming,

Thank you again for your interest in NVMe over MPTCP.

On Thu, 2026-02-26 at 17:54 +0800, Geliang Tang wrote:
> Hi Nilay,
> 
> Thanks for your reply.
> 
> On Wed, 2026-02-25 at 20:37 +0530, Nilay Shroff wrote:
> > 
> > 
> > On 1/29/26 9:43 AM, Geliang Tang wrote:
> > > 3. Performance Benefits
> > > 
> > > This new feature has been evaluated in different environments:
> > > 
> > > I conducted 'NVMe over MPTCP' tests between two PCs, each
> > > equipped
> > > with
> > > two Gigabit NICs and directly connected via Ethernet cables.
> > > Using
> > > 'NVMe over TCP', the fio benchmark showed a speed of
> > > approximately
> > > 100
> > > MiB/s. In contrast, 'NVMe over MPTCP' achieved about 200 MiB/s
> > > with
> > > fio, doubling the throughput.
> > > 
> > > In a virtual machine test environment simulating four NICs on
> > > both
> > > sides, 'NVMe over MPTCP' delivered bandwidth up to four times
> > > that
> > > of
> > > standard TCP.
> > 
> > This is interesting. Did you try using an NVMe multipath iopolicy
> > other
> > than the default numa policy? Assuming both the host and target are
> > multihomed,
> > configuring round-robin or queue-depth may provide performance
> > comparable
> > to what you are seeing with MPTCP.
> > 
> > I think MPTCP shall distribute traffic using transport-level
> > metrics
> > such as
> > RTT, cwnd, and packet loss, whereas the NVMe multipath layer makes
> > decisions
> > based on ANA state, queue depth, and NUMA locality. In a setup with
> > multiple
> > active paths, switching the iopolicy from numa to round-robin or
> > queue-depth
> > could improve load distribution across controllers and thus improve
> > performance.
> > 
> > IMO, it would be useful to test with those policies and compare the
> > results
> > against the MPTCP setup.
> 
> Ming Lei also made a similar comment. In my experiments, I didn't set
> the multipath iopolicy, so I was using the default numa policy. In
> the
> follow-up, I'll adjust it to round-robin or queue-depth and rerun the
> experiments. I'll share the results in this email thread.

Based on your feedback, I have added iopolicy support to the NVMe over
MPTCP selftest script (see patch 8 in [1]). We can set the iopolicy to
round-robin like this:

 # ./mptcp_nvme.sh mptcp round-robin

This demonstrates that "NVMe over MPTCP" and "NVMe multipath" can work
simultaneously without conflict.

Using this test script, I compared three I/O policies: numa, round-
robin, and queue-depth. The results for fio were very similar. It's
possible that this test environment doesn't fully reflect the
differences in I/O policies. I will continue to follow up with further
tests.

Thanks,
-Geliang

[1]
NVME over MPTCP, v4
https://patchwork.kernel.org/project/mptcp/cover/cover.1772683110.git.tanggeliang@kylinos.cn/

> 
> Thanks,
> -Geliang
> 
> > 
> > Thanks,
> > --Nilay


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-05  4:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-29  4:13 [LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments Geliang Tang
2026-02-25  5:57 ` Ming Lei
2026-02-26  9:44   ` Geliang Tang
2026-02-25 15:07 ` Nilay Shroff
2026-02-26  9:54   ` Geliang Tang
2026-03-05  4:30     ` Geliang Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox