[PATCH mptcp-next] mptcp: record subflows in RPS table

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH mptcp-next] mptcp: record subflows in RPS table
@ 2025-08-07 18:14 ` Christoph Paasch via B4 Relay
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Paasch @ 2025-08-07 18:14 UTC (permalink / raw)
  To: MPTCP Upstream, Matthieu Baerts, Mat Martineau, Geliang Tang
  Cc: Christoph Paasch

Accelerated Receive Flow Steering (aRFS) relies on sockets recording
their RX flow hash into the rps_sock_flow_table so that incoming packets
are steered to the CPU where the application runs.

With MPTCP, the application interacts with the parent MPTCP socket while
data is carried over per-subflow TCP sockets. Without recording these
subflows, aRFS cannot steer interrupts and RX processing for the flows
to the desired CPU.

Record all subflows in the RPS table by calling sock_rps_record_flow()
for each subflow at the start of mptcp_sendmsg() and mptcp_recvmsg().

It does not by itself improve throughput, but ensures that IRQ and RX
processing are directed to the right CPU, which is a
prerequisite for effective aRFS.

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 net/mptcp/protocol.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4b510e04724f5c8be08a00a1cc03093bcd031905..86f1fc5be47bd5f6dc8bf40a488de32f253c1beb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -12,6 +12,7 @@
 #include <linux/sched/signal.h>
 #include <linux/atomic.h>
 #include <net/aligned_data.h>
+#include <net/rps.h>
 #include <net/sock.h>
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
@@ -1740,6 +1741,17 @@ static u32 mptcp_send_limit(const struct sock *sk)
 	return limit - not_sent;
 }
 
+static void mptcp_rps_record_subflows(const struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+
+	mptcp_for_each_subflow(msk, subflow) {
+		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+
+		sock_rps_record_flow(ssk);
+	}
+}
+
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -1753,6 +1765,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	lock_sock(sk);
 
+	mptcp_rps_record_subflows(msk);
+
 	if (unlikely(inet_test_bit(DEFER_CONNECT, sk) ||
 		     msg->msg_flags & MSG_FASTOPEN)) {
 		int copied_syn = 0;
@@ -2131,6 +2145,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		goto out_err;
 	}
 
+	mptcp_rps_record_subflows(msk);
+
 	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
 	len = min_t(size_t, len, INT_MAX);

---
base-commit: f756e5bb29fa59378c111a2401fc64e44cd53442
change-id: 20250731-b4-mptcp_perf-02912dd4f65b

Best regards,
-- 
Christoph Paasch <cpaasch@openai.com>


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH mptcp-next] mptcp: record subflows in RPS table
@ 2025-08-07 18:14 ` Christoph Paasch via B4 Relay
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Paasch via B4 Relay @ 2025-08-07 18:14 UTC (permalink / raw)
  To: MPTCP Upstream, Matthieu Baerts, Mat Martineau, Geliang Tang
  Cc: Christoph Paasch

From: Christoph Paasch <cpaasch@openai.com>

Accelerated Receive Flow Steering (aRFS) relies on sockets recording
their RX flow hash into the rps_sock_flow_table so that incoming packets
are steered to the CPU where the application runs.

With MPTCP, the application interacts with the parent MPTCP socket while
data is carried over per-subflow TCP sockets. Without recording these
subflows, aRFS cannot steer interrupts and RX processing for the flows
to the desired CPU.

Record all subflows in the RPS table by calling sock_rps_record_flow()
for each subflow at the start of mptcp_sendmsg() and mptcp_recvmsg().

It does not by itself improve throughput, but ensures that IRQ and RX
processing are directed to the right CPU, which is a
prerequisite for effective aRFS.

Signed-off-by: Christoph Paasch <cpaasch@openai.com>
---
 net/mptcp/protocol.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4b510e04724f5c8be08a00a1cc03093bcd031905..86f1fc5be47bd5f6dc8bf40a488de32f253c1beb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -12,6 +12,7 @@
 #include <linux/sched/signal.h>
 #include <linux/atomic.h>
 #include <net/aligned_data.h>
+#include <net/rps.h>
 #include <net/sock.h>
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
@@ -1740,6 +1741,17 @@ static u32 mptcp_send_limit(const struct sock *sk)
 	return limit - not_sent;
 }
 
+static void mptcp_rps_record_subflows(const struct mptcp_sock *msk)
+{
+	struct mptcp_subflow_context *subflow;
+
+	mptcp_for_each_subflow(msk, subflow) {
+		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+
+		sock_rps_record_flow(ssk);
+	}
+}
+
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
@@ -1753,6 +1765,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	lock_sock(sk);
 
+	mptcp_rps_record_subflows(msk);
+
 	if (unlikely(inet_test_bit(DEFER_CONNECT, sk) ||
 		     msg->msg_flags & MSG_FASTOPEN)) {
 		int copied_syn = 0;
@@ -2131,6 +2145,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		goto out_err;
 	}
 
+	mptcp_rps_record_subflows(msk);
+
 	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
 	len = min_t(size_t, len, INT_MAX);

---
base-commit: f756e5bb29fa59378c111a2401fc64e44cd53442
change-id: 20250731-b4-mptcp_perf-02912dd4f65b

Best regards,
-- 
Christoph Paasch <cpaasch@openai.com>



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH mptcp-next] mptcp: record subflows in RPS table
  2025-08-07 18:14 ` Christoph Paasch via B4 Relay
  (?)
@ 2025-08-07 19:46 ` MPTCP CI
  -1 siblings, 0 replies; 8+ messages in thread
From: MPTCP CI @ 2025-08-07 19:46 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: mptcp

Hi Christoph,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal: Success! ✅
- KVM Validation: debug: Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/16812537653

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/8452b3707158
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=989208

If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker

Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH mptcp-next] mptcp: record subflows in RPS table
  2025-08-07 18:14 ` Christoph Paasch via B4 Relay
  (?)
  (?)
@ 2025-08-13 14:53 ` Matthieu Baerts
  2025-08-14  3:31   ` Christoph Paasch
  -1 siblings, 1 reply; 8+ messages in thread
From: Matthieu Baerts @ 2025-08-13 14:53 UTC (permalink / raw)
  To: cpaasch; +Cc: MPTCP Upstream, Geliang Tang, Mat Martineau

Hi Christoph,

On 07/08/2025 20:14, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
> 
> Accelerated Receive Flow Steering (aRFS) relies on sockets recording
> their RX flow hash into the rps_sock_flow_table so that incoming packets
> are steered to the CPU where the application runs.
> 
> With MPTCP, the application interacts with the parent MPTCP socket while
> data is carried over per-subflow TCP sockets. Without recording these
> subflows, aRFS cannot steer interrupts and RX processing for the flows
> to the desired CPU.
> 
> Record all subflows in the RPS table by calling sock_rps_record_flow()
> for each subflow at the start of mptcp_sendmsg() and mptcp_recvmsg().
> 
> It does not by itself improve throughput, but ensures that IRQ and RX
> processing are directed to the right CPU, which is a
> prerequisite for effective aRFS.

Thank you for looking at that!

> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
>  net/mptcp/protocol.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 4b510e04724f5c8be08a00a1cc03093bcd031905..86f1fc5be47bd5f6dc8bf40a488de32f253c1beb 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -12,6 +12,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/atomic.h>
>  #include <net/aligned_data.h>
> +#include <net/rps.h>
>  #include <net/sock.h>
>  #include <net/inet_common.h>
>  #include <net/inet_hashtables.h>
> @@ -1740,6 +1741,17 @@ static u32 mptcp_send_limit(const struct sock *sk)
>  	return limit - not_sent;
>  }
>  
> +static void mptcp_rps_record_subflows(const struct mptcp_sock *msk)
> +{
> +	struct mptcp_subflow_context *subflow;
> +
> +	mptcp_for_each_subflow(msk, subflow) {
> +		struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
> +
> +		sock_rps_record_flow(ssk);

Should we eventually open code this RPS function and directly call
sock_rps_record_flow_hash()? If RPS is not needed, no need to go through
the subflow list. So we could add: #ifdef CONFIG_RPS + &rfs_needed
before the for_each, then check the state for each subflow, no?

Or maybe such optimisation is not worth it?

While at it, I see that sock_rps_record_flow() is called in
__inet_accept(), inet(6)_sendmsg and inet(6)_recvmsg, but with the msk.
I guess that's OK, right?
Also, is sock_rps_record_flow() called when a subflow is "accept()ed",
or is there something else to do there for this subflow to copy TCP's
behaviour?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH mptcp-next] mptcp: record subflows in RPS table
  2025-08-13 14:53 ` Matthieu Baerts
@ 2025-08-14  3:31   ` Christoph Paasch
  2025-08-14 11:20     ` Matthieu Baerts
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Paasch @ 2025-08-14  3:31 UTC (permalink / raw)
  To: Matthieu Baerts; +Cc: MPTCP Upstream, Geliang Tang, Mat Martineau

On Wed, Aug 13, 2025 at 7:54 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Christoph,
>
> On 07/08/2025 20:14, Christoph Paasch via B4 Relay wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > Accelerated Receive Flow Steering (aRFS) relies on sockets recording
> > their RX flow hash into the rps_sock_flow_table so that incoming packets
> > are steered to the CPU where the application runs.
> >
> > With MPTCP, the application interacts with the parent MPTCP socket while
> > data is carried over per-subflow TCP sockets. Without recording these
> > subflows, aRFS cannot steer interrupts and RX processing for the flows
> > to the desired CPU.
> >
> > Record all subflows in the RPS table by calling sock_rps_record_flow()
> > for each subflow at the start of mptcp_sendmsg() and mptcp_recvmsg().
> >
> > It does not by itself improve throughput, but ensures that IRQ and RX
> > processing are directed to the right CPU, which is a
> > prerequisite for effective aRFS.
>
> Thank you for looking at that!
>
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> >  net/mptcp/protocol.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index 4b510e04724f5c8be08a00a1cc03093bcd031905..86f1fc5be47bd5f6dc8bf40a488de32f253c1beb 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/sched/signal.h>
> >  #include <linux/atomic.h>
> >  #include <net/aligned_data.h>
> > +#include <net/rps.h>
> >  #include <net/sock.h>
> >  #include <net/inet_common.h>
> >  #include <net/inet_hashtables.h>
> > @@ -1740,6 +1741,17 @@ static u32 mptcp_send_limit(const struct sock *sk)
> >       return limit - not_sent;
> >  }
> >
> > +static void mptcp_rps_record_subflows(const struct mptcp_sock *msk)
> > +{
> > +     struct mptcp_subflow_context *subflow;
> > +
> > +     mptcp_for_each_subflow(msk, subflow) {
> > +             struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
> > +
> > +             sock_rps_record_flow(ssk);
>
> Should we eventually open code this RPS function and directly call
> sock_rps_record_flow_hash()? If RPS is not needed, no need to go through
> the subflow list. So we could add: #ifdef CONFIG_RPS + &rfs_needed
> before the for_each, then check the state for each subflow, no?
>
> Or maybe such optimisation is not worth it?

IMO it's not worth it. Because it would also mean that we pull in more
of non-MPTCP things over to MPTCP files.

> While at it, I see that sock_rps_record_flow() is called in
> __inet_accept(), inet(6)_sendmsg and inet(6)_recvmsg, but with the msk.
> I guess that's OK, right?

AFAICS, that should be alright. sk_rxhash should not be set on an msk.
Even on the passive opener side, we clone msk from the ssk, but at
that point sk_rxhash is not set yet. (I may be wrong of course, I'm
still not super familiar with the source-code...)

> Also, is sock_rps_record_flow() called when a subflow is "accept()ed",
> or is there something else to do there for this subflow to copy TCP's
> behaviour?

For new subflows, I don't think we can easily do that. Because
sock_rps_record_flow() records the CPU on which it is running. From
what I see, when new subflows get accepted, that is not happening in
the application's context necessarily. So, we have to wait for a
recvmsg/sendmsg.


Christoph

>
> Cheers,
> Matt
> --
> Sponsored by the NGI0 Core fund.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH mptcp-next] mptcp: record subflows in RPS table
  2025-08-14  3:31   ` Christoph Paasch
@ 2025-08-14 11:20     ` Matthieu Baerts
  2025-08-14 11:47       ` Matthieu Baerts
  0 siblings, 1 reply; 8+ messages in thread
From: Matthieu Baerts @ 2025-08-14 11:20 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: MPTCP Upstream, Geliang Tang, Mat Martineau

Hi Christoph,

On 14/08/2025 05:31, Christoph Paasch wrote:
> On Wed, Aug 13, 2025 at 7:54 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Christoph,
>>
>> On 07/08/2025 20:14, Christoph Paasch via B4 Relay wrote:
>>> From: Christoph Paasch <cpaasch@openai.com>
>>>
>>> Accelerated Receive Flow Steering (aRFS) relies on sockets recording
>>> their RX flow hash into the rps_sock_flow_table so that incoming packets
>>> are steered to the CPU where the application runs.
>>>
>>> With MPTCP, the application interacts with the parent MPTCP socket while
>>> data is carried over per-subflow TCP sockets. Without recording these
>>> subflows, aRFS cannot steer interrupts and RX processing for the flows
>>> to the desired CPU.
>>>
>>> Record all subflows in the RPS table by calling sock_rps_record_flow()
>>> for each subflow at the start of mptcp_sendmsg() and mptcp_recvmsg().
>>>
>>> It does not by itself improve throughput, but ensures that IRQ and RX
>>> processing are directed to the right CPU, which is a
>>> prerequisite for effective aRFS.
>>
>> Thank you for looking at that!
>>
>>> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
>>> ---
>>>  net/mptcp/protocol.c | 16 ++++++++++++++++
>>>  1 file changed, 16 insertions(+)
>>>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index 4b510e04724f5c8be08a00a1cc03093bcd031905..86f1fc5be47bd5f6dc8bf40a488de32f253c1beb 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>>> @@ -12,6 +12,7 @@
>>>  #include <linux/sched/signal.h>
>>>  #include <linux/atomic.h>
>>>  #include <net/aligned_data.h>
>>> +#include <net/rps.h>
>>>  #include <net/sock.h>
>>>  #include <net/inet_common.h>
>>>  #include <net/inet_hashtables.h>
>>> @@ -1740,6 +1741,17 @@ static u32 mptcp_send_limit(const struct sock *sk)
>>>       return limit - not_sent;
>>>  }
>>>
>>> +static void mptcp_rps_record_subflows(const struct mptcp_sock *msk)
>>> +{
>>> +     struct mptcp_subflow_context *subflow;
>>> +
>>> +     mptcp_for_each_subflow(msk, subflow) {
>>> +             struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>>> +
>>> +             sock_rps_record_flow(ssk);
>>
>> Should we eventually open code this RPS function and directly call
>> sock_rps_record_flow_hash()? If RPS is not needed, no need to go through
>> the subflow list. So we could add: #ifdef CONFIG_RPS + &rfs_needed
>> before the for_each, then check the state for each subflow, no?
>>
>> Or maybe such optimisation is not worth it?
> 
> IMO it's not worth it. Because it would also mean that we pull in more
> of non-MPTCP things over to MPTCP files.

Good point for the maintenance!

But then, what about adding a small helper in include/net/rps.h? Here we
could have this at the top of mptcp_rps_record_subflows():

  if (!sock_rps_record_needed())   /* with likely()? */
      return;

I'm suggesting this because rfs is disabled by default, so it sounds
better to check that before iterating over different subflows for
nothing, no?

>> While at it, I see that sock_rps_record_flow() is called in
>> __inet_accept(), inet(6)_sendmsg and inet(6)_recvmsg, but with the msk.
>> I guess that's OK, right?
> 
> AFAICS, that should be alright. sk_rxhash should not be set on an msk.
> Even on the passive opener side, we clone msk from the ssk, but at
> that point sk_rxhash is not set yet. (I may be wrong of course, I'm
> still not super familiar with the source-code...)

Thank you for having looked! I guess you would have seen issues when
testing if there was a problem with that, so no need to investigate more.

>> Also, is sock_rps_record_flow() called when a subflow is "accept()ed",
>> or is there something else to do there for this subflow to copy TCP's
>> behaviour?
> 
> For new subflows, I don't think we can easily do that. Because
> sock_rps_record_flow() records the CPU on which it is running. From
> what I see, when new subflows get accepted, that is not happening in
> the application's context necessarily. So, we have to wait for a
> recvmsg/sendmsg.

Ah yes, of course, we should not do it there for additional subflows!
And for the first subflow, I think that it will be done because when an
accept() is triggered from the userspace, mptcp_stream_accept(msk, arg)
will call inet_csk_accept(ssk, arg) that will end up doing that I think.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH mptcp-next] mptcp: record subflows in RPS table
  2025-08-14 11:20     ` Matthieu Baerts
@ 2025-08-14 11:47       ` Matthieu Baerts
  2025-08-26  4:10         ` Christoph Paasch
  0 siblings, 1 reply; 8+ messages in thread
From: Matthieu Baerts @ 2025-08-14 11:47 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: MPTCP Upstream, Geliang Tang, Mat Martineau

Hi Christoph,

On 14/08/2025 13:20, Matthieu Baerts wrote:
> Hi Christoph,
> 
> On 14/08/2025 05:31, Christoph Paasch wrote:
>> On Wed, Aug 13, 2025 at 7:54 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>>>
>>> Hi Christoph,
>>>
>>> On 07/08/2025 20:14, Christoph Paasch via B4 Relay wrote:
>>>> From: Christoph Paasch <cpaasch@openai.com>
>>>>
>>>> Accelerated Receive Flow Steering (aRFS) relies on sockets recording
>>>> their RX flow hash into the rps_sock_flow_table so that incoming packets
>>>> are steered to the CPU where the application runs.
>>>>
>>>> With MPTCP, the application interacts with the parent MPTCP socket while
>>>> data is carried over per-subflow TCP sockets. Without recording these
>>>> subflows, aRFS cannot steer interrupts and RX processing for the flows
>>>> to the desired CPU.
>>>>
>>>> Record all subflows in the RPS table by calling sock_rps_record_flow()
>>>> for each subflow at the start of mptcp_sendmsg() and mptcp_recvmsg().
>>>>
>>>> It does not by itself improve throughput, but ensures that IRQ and RX
>>>> processing are directed to the right CPU, which is a
>>>> prerequisite for effective aRFS.
>>>
>>> Thank you for looking at that!
>>>
>>>> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
>>>> ---
>>>>  net/mptcp/protocol.c | 16 ++++++++++++++++
>>>>  1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>>> index 4b510e04724f5c8be08a00a1cc03093bcd031905..86f1fc5be47bd5f6dc8bf40a488de32f253c1beb 100644
>>>> --- a/net/mptcp/protocol.c
>>>> +++ b/net/mptcp/protocol.c
>>>> @@ -12,6 +12,7 @@
>>>>  #include <linux/sched/signal.h>
>>>>  #include <linux/atomic.h>
>>>>  #include <net/aligned_data.h>
>>>> +#include <net/rps.h>
>>>>  #include <net/sock.h>
>>>>  #include <net/inet_common.h>
>>>>  #include <net/inet_hashtables.h>
>>>> @@ -1740,6 +1741,17 @@ static u32 mptcp_send_limit(const struct sock *sk)
>>>>       return limit - not_sent;
>>>>  }
>>>>
>>>> +static void mptcp_rps_record_subflows(const struct mptcp_sock *msk)
>>>> +{
>>>> +     struct mptcp_subflow_context *subflow;
>>>> +
>>>> +     mptcp_for_each_subflow(msk, subflow) {
>>>> +             struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
>>>> +
>>>> +             sock_rps_record_flow(ssk);
>>>
>>> Should we eventually open code this RPS function and directly call
>>> sock_rps_record_flow_hash()? If RPS is not needed, no need to go through
>>> the subflow list. So we could add: #ifdef CONFIG_RPS + &rfs_needed
>>> before the for_each, then check the state for each subflow, no?
>>>
>>> Or maybe such optimisation is not worth it?
>>
>> IMO it's not worth it. Because it would also mean that we pull in more
>> of non-MPTCP things over to MPTCP files.
> 
> Good point for the maintenance!
> 
> But then, what about adding a small helper in include/net/rps.h? Here we
> could have this at the top of mptcp_rps_record_subflows():
> 
>   if (!sock_rps_record_needed())   /* with likely()? */
>       return;
> 
> I'm suggesting this because rfs is disabled by default, so it sounds
> better to check that before iterating over different subflows for
> nothing, no?
> 
>>> While at it, I see that sock_rps_record_flow() is called in
>>> __inet_accept(), inet(6)_sendmsg and inet(6)_recvmsg, but with the msk.
>>> I guess that's OK, right?
>>
>> AFAICS, that should be alright. sk_rxhash should not be set on an msk.
>> Even on the passive opener side, we clone msk from the ssk, but at
>> that point sk_rxhash is not set yet. (I may be wrong of course, I'm
>> still not super familiar with the source-code...)
> 
> Thank you for having looked! I guess you would have seen issues when
> testing if there was a problem with that, so no need to investigate more.
> 
>>> Also, is sock_rps_record_flow() called when a subflow is "accept()ed",
>>> or is there something else to do there for this subflow to copy TCP's
>>> behaviour?
>>
>> For new subflows, I don't think we can easily do that. Because
>> sock_rps_record_flow() records the CPU on which it is running. From
>> what I see, when new subflows get accepted, that is not happening in
>> the application's context necessarily. So, we have to wait for a
>> recvmsg/sendmsg.
> 
> Ah yes, of course, we should not do it there for additional subflows!
> And for the first subflow, I think that it will be done because when an
> accept() is triggered from the userspace, mptcp_stream_accept(msk, arg)
> will call inet_csk_accept(ssk, arg) that will end up doing that I think.

My bad, I think __inet_accept will not be called from inet_csk_accept().
For TCP, I think the path is:

  inet_accept()
    --> inet_csk_accept()
    --> __inet_accept()

But with MPTCP, mptcp_stream_accept is called instead of inet_accept(),
and I don't think __inet_accept() will be called on a subflow. (I didn't
validate that.) Then, maybe we should call sock_rps_record_flow() on the
newsk in mptcp_stream_accept, no?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH mptcp-next] mptcp: record subflows in RPS table
  2025-08-14 11:47       ` Matthieu Baerts
@ 2025-08-26  4:10         ` Christoph Paasch
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Paasch @ 2025-08-26  4:10 UTC (permalink / raw)
  To: Matthieu Baerts; +Cc: MPTCP Upstream, Geliang Tang, Mat Martineau

On Thu, Aug 14, 2025 at 4:47 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Christoph,
>
> On 14/08/2025 13:20, Matthieu Baerts wrote:
> > Hi Christoph,
> >
> > On 14/08/2025 05:31, Christoph Paasch wrote:
> >> On Wed, Aug 13, 2025 at 7:54 AM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>>
> >>> Hi Christoph,
> >>>
> >>> On 07/08/2025 20:14, Christoph Paasch via B4 Relay wrote:
> >>>> From: Christoph Paasch <cpaasch@openai.com>
> >>>>
> >>>> Accelerated Receive Flow Steering (aRFS) relies on sockets recording
> >>>> their RX flow hash into the rps_sock_flow_table so that incoming packets
> >>>> are steered to the CPU where the application runs.
> >>>>
> >>>> With MPTCP, the application interacts with the parent MPTCP socket while
> >>>> data is carried over per-subflow TCP sockets. Without recording these
> >>>> subflows, aRFS cannot steer interrupts and RX processing for the flows
> >>>> to the desired CPU.
> >>>>
> >>>> Record all subflows in the RPS table by calling sock_rps_record_flow()
> >>>> for each subflow at the start of mptcp_sendmsg() and mptcp_recvmsg().
> >>>>
> >>>> It does not by itself improve throughput, but ensures that IRQ and RX
> >>>> processing are directed to the right CPU, which is a
> >>>> prerequisite for effective aRFS.
> >>>
> >>> Thank you for looking at that!
> >>>
> >>>> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> >>>> ---
> >>>>  net/mptcp/protocol.c | 16 ++++++++++++++++
> >>>>  1 file changed, 16 insertions(+)
> >>>>
> >>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> >>>> index 4b510e04724f5c8be08a00a1cc03093bcd031905..86f1fc5be47bd5f6dc8bf40a488de32f253c1beb 100644
> >>>> --- a/net/mptcp/protocol.c
> >>>> +++ b/net/mptcp/protocol.c
> >>>> @@ -12,6 +12,7 @@
> >>>>  #include <linux/sched/signal.h>
> >>>>  #include <linux/atomic.h>
> >>>>  #include <net/aligned_data.h>
> >>>> +#include <net/rps.h>
> >>>>  #include <net/sock.h>
> >>>>  #include <net/inet_common.h>
> >>>>  #include <net/inet_hashtables.h>
> >>>> @@ -1740,6 +1741,17 @@ static u32 mptcp_send_limit(const struct sock *sk)
> >>>>       return limit - not_sent;
> >>>>  }
> >>>>
> >>>> +static void mptcp_rps_record_subflows(const struct mptcp_sock *msk)
> >>>> +{
> >>>> +     struct mptcp_subflow_context *subflow;
> >>>> +
> >>>> +     mptcp_for_each_subflow(msk, subflow) {
> >>>> +             struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
> >>>> +
> >>>> +             sock_rps_record_flow(ssk);
> >>>
> >>> Should we eventually open code this RPS function and directly call
> >>> sock_rps_record_flow_hash()? If RPS is not needed, no need to go through
> >>> the subflow list. So we could add: #ifdef CONFIG_RPS + &rfs_needed
> >>> before the for_each, then check the state for each subflow, no?
> >>>
> >>> Or maybe such optimisation is not worth it?
> >>
> >> IMO it's not worth it. Because it would also mean that we pull in more
> >> of non-MPTCP things over to MPTCP files.
> >
> > Good point for the maintenance!
> >
> > But then, what about adding a small helper in include/net/rps.h? Here we
> > could have this at the top of mptcp_rps_record_subflows():
> >
> >   if (!sock_rps_record_needed())   /* with likely()? */
> >       return;
> >
> > I'm suggesting this because rfs is disabled by default, so it sounds
> > better to check that before iterating over different subflows for
> > nothing, no?

I see! Yes, that may be a good addition. I am doing that.

> >>> While at it, I see that sock_rps_record_flow() is called in
> >>> __inet_accept(), inet(6)_sendmsg and inet(6)_recvmsg, but with the msk.
> >>> I guess that's OK, right?
> >>
> >> AFAICS, that should be alright. sk_rxhash should not be set on an msk.
> >> Even on the passive opener side, we clone msk from the ssk, but at
> >> that point sk_rxhash is not set yet. (I may be wrong of course, I'm
> >> still not super familiar with the source-code...)
> >
> > Thank you for having looked! I guess you would have seen issues when
> > testing if there was a problem with that, so no need to investigate more.
> >
> >>> Also, is sock_rps_record_flow() called when a subflow is "accept()ed",
> >>> or is there something else to do there for this subflow to copy TCP's
> >>> behaviour?
> >>
> >> For new subflows, I don't think we can easily do that. Because
> >> sock_rps_record_flow() records the CPU on which it is running. From
> >> what I see, when new subflows get accepted, that is not happening in
> >> the application's context necessarily. So, we have to wait for a
> >> recvmsg/sendmsg.
> >
> > Ah yes, of course, we should not do it there for additional subflows!
> > And for the first subflow, I think that it will be done because when an
> > accept() is triggered from the userspace, mptcp_stream_accept(msk, arg)
> > will call inet_csk_accept(ssk, arg) that will end up doing that I think.
>
> My bad, I think __inet_accept will not be called from inet_csk_accept().
> For TCP, I think the path is:
>
>   inet_accept()
>     --> inet_csk_accept()
>     --> __inet_accept()
>
> But with MPTCP, mptcp_stream_accept is called instead of inet_accept(),
> and I don't think __inet_accept() will be called on a subflow. (I didn't
> validate that.) Then, maybe we should call sock_rps_record_flow() on the
> newsk in mptcp_stream_accept, no?

Indeed! I will iterate over the subflows there as well and take care of them.


Christoph

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-26  4:10 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 18:14 [PATCH mptcp-next] mptcp: record subflows in RPS table Christoph Paasch
2025-08-07 18:14 ` Christoph Paasch via B4 Relay
2025-08-07 19:46 ` MPTCP CI
2025-08-13 14:53 ` Matthieu Baerts
2025-08-14  3:31   ` Christoph Paasch
2025-08-14 11:20     ` Matthieu Baerts
2025-08-14 11:47       ` Matthieu Baerts
2025-08-26  4:10         ` Christoph Paasch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.