* [PATCH net-next 1/9] mptcp: add a new sysctl for make after break timeout
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-25 0:54 ` Jakub Kicinski
2023-10-23 20:44 ` [PATCH net-next 2/9] mptcp: properly account fastopen data Mat Martineau
` (8 subsequent siblings)
9 siblings, 1 reply; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
The MPTCP protocol allows sockets with no alive subflows to stay
in ESTABLISHED status for and user-defined timeout, to allow for
later subflows creation.
Currently such timeout is constant - TCP_TIMEWAIT_LEN. Let the
user-space configure them via a newly added sysctl, to better cope
with busy servers and simplify (make them faster) the relevant
pktdrill tests.
Note that the new know does not apply to orphaned MPTCP socket
waiting for the data_fin handshake completion: they always wait
TCP_TIMEWAIT_LEN.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
Documentation/networking/mptcp-sysctl.rst | 11 +++++++++++
net/mptcp/ctrl.c | 16 ++++++++++++++++
net/mptcp/protocol.c | 6 +++---
net/mptcp/protocol.h | 1 +
4 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 15f1919d640c..69975ce25a02 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -25,6 +25,17 @@ add_addr_timeout - INTEGER (seconds)
Default: 120
+close_timeout - INTEGER (seconds)
+ Set the make-after-break timeout: in absence of any close or
+ shutdown syscall, MPTCP sockets will maintain the status
+ unchanged for such time, after the last subflow removal, before
+ moving to TCP_CLOSE.
+
+ The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
+ sysctl.
+
+ Default: 60
+
checksum_enabled - BOOLEAN
Control whether DSS checksum can be enabled.
diff --git a/net/mptcp/ctrl.c b/net/mptcp/ctrl.c
index e72b518c5d02..13fe0748dde8 100644
--- a/net/mptcp/ctrl.c
+++ b/net/mptcp/ctrl.c
@@ -27,6 +27,7 @@ struct mptcp_pernet {
#endif
unsigned int add_addr_timeout;
+ unsigned int close_timeout;
unsigned int stale_loss_cnt;
u8 mptcp_enabled;
u8 checksum_enabled;
@@ -65,6 +66,13 @@ unsigned int mptcp_stale_loss_cnt(const struct net *net)
return mptcp_get_pernet(net)->stale_loss_cnt;
}
+unsigned int mptcp_close_timeout(const struct sock *sk)
+{
+ if (sock_flag(sk, SOCK_DEAD))
+ return TCP_TIMEWAIT_LEN;
+ return mptcp_get_pernet(sock_net(sk))->close_timeout;
+}
+
int mptcp_get_pm_type(const struct net *net)
{
return mptcp_get_pernet(net)->pm_type;
@@ -79,6 +87,7 @@ static void mptcp_pernet_set_defaults(struct mptcp_pernet *pernet)
{
pernet->mptcp_enabled = 1;
pernet->add_addr_timeout = TCP_RTO_MAX;
+ pernet->close_timeout = TCP_TIMEWAIT_LEN;
pernet->checksum_enabled = 0;
pernet->allow_join_initial_addr_port = 1;
pernet->stale_loss_cnt = 4;
@@ -141,6 +150,12 @@ static struct ctl_table mptcp_sysctl_table[] = {
.mode = 0644,
.proc_handler = proc_dostring,
},
+ {
+ .procname = "close_timeout",
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_jiffies,
+ },
{}
};
@@ -163,6 +178,7 @@ static int mptcp_pernet_new_table(struct net *net, struct mptcp_pernet *pernet)
table[4].data = &pernet->stale_loss_cnt;
table[5].data = &pernet->pm_type;
table[6].data = &pernet->scheduler;
+ table[7].data = &pernet->close_timeout;
hdr = register_net_sysctl_sz(net, MPTCP_SYSCTL_PATH, table,
ARRAY_SIZE(mptcp_sysctl_table));
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 886ab689a8ae..a21f8ed26343 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2391,8 +2391,8 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
if (msk->in_accept_queue && msk->first == ssk &&
(sock_flag(sk, SOCK_DEAD) || sock_flag(ssk, SOCK_DEAD))) {
/* ensure later check in mptcp_worker() will dispose the msk */
- mptcp_set_close_tout(sk, tcp_jiffies32 - (TCP_TIMEWAIT_LEN + 1));
sock_set_flag(sk, SOCK_DEAD);
+ mptcp_set_close_tout(sk, tcp_jiffies32 - (mptcp_close_timeout(sk) + 1));
lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
mptcp_subflow_drop_ctx(ssk);
goto out_release;
@@ -2516,7 +2516,7 @@ static bool mptcp_close_tout_expired(const struct sock *sk)
return false;
return time_after32(tcp_jiffies32,
- inet_csk(sk)->icsk_mtup.probe_timestamp + TCP_TIMEWAIT_LEN);
+ inet_csk(sk)->icsk_mtup.probe_timestamp + mptcp_close_timeout(sk));
}
static void mptcp_check_fastclose(struct mptcp_sock *msk)
@@ -2659,7 +2659,7 @@ void mptcp_reset_tout_timer(struct mptcp_sock *msk, unsigned long fail_tout)
return;
close_timeout = inet_csk(sk)->icsk_mtup.probe_timestamp - tcp_jiffies32 + jiffies +
- TCP_TIMEWAIT_LEN;
+ mptcp_close_timeout(sk);
/* the close timeout takes precedence on the fail one, and here at least one of
* them is active
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 3612545fa62e..02556921bc6c 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -613,6 +613,7 @@ unsigned int mptcp_get_add_addr_timeout(const struct net *net);
int mptcp_is_checksum_enabled(const struct net *net);
int mptcp_allow_join_id0(const struct net *net);
unsigned int mptcp_stale_loss_cnt(const struct net *net);
+unsigned int mptcp_close_timeout(const struct sock *sk);
int mptcp_get_pm_type(const struct net *net);
const char *mptcp_get_scheduler(const struct net *net);
void mptcp_subflow_fully_established(struct mptcp_subflow_context *subflow,
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH net-next 1/9] mptcp: add a new sysctl for make after break timeout
2023-10-23 20:44 ` [PATCH net-next 1/9] mptcp: add a new sysctl for make after break timeout Mat Martineau
@ 2023-10-25 0:54 ` Jakub Kicinski
2023-10-25 13:48 ` Paolo Abeni
0 siblings, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2023-10-25 0:54 UTC (permalink / raw)
To: Mat Martineau
Cc: Matthieu Baerts, David S. Miller, Eric Dumazet, Paolo Abeni,
netdev, mptcp
On Mon, 23 Oct 2023 13:44:34 -0700 Mat Martineau wrote:
> + .procname = "close_timeout",
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_jiffies,
Silly question - proc_dointvec_jiffies() works fine for unsigned types?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net-next 1/9] mptcp: add a new sysctl for make after break timeout
2023-10-25 0:54 ` Jakub Kicinski
@ 2023-10-25 13:48 ` Paolo Abeni
0 siblings, 0 replies; 17+ messages in thread
From: Paolo Abeni @ 2023-10-25 13:48 UTC (permalink / raw)
To: Jakub Kicinski, Mat Martineau
Cc: Matthieu Baerts, David S. Miller, Eric Dumazet, netdev, mptcp
On Tue, 2023-10-24 at 17:54 -0700, Jakub Kicinski wrote:
> On Mon, 23 Oct 2023 13:44:34 -0700 Mat Martineau wrote:
> > + .procname = "close_timeout",
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_jiffies,
>
> Silly question - proc_dointvec_jiffies() works fine for unsigned types?
AFAICS, yes: proc_dointvec_jiffies() internally uses
__do_proc_dointvec(), which parses the buffer to a 'long' local
variable and then do_proc_dointvec_jiffies_conv() converts it to
integer taking care of possible overflows.
Cheers,
Paolo
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH net-next 2/9] mptcp: properly account fastopen data
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 1/9] mptcp: add a new sysctl for make after break timeout Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 3/9] mptcp: use plain bool instead of custom binary enum Mat Martineau
` (7 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
Currently the socket level counter aggregating the received data
does not take in account the data received via fastopen.
Address the issue updating the counter as required.
Fixes: 38967f424b5b ("mptcp: track some aggregate data counters")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
net/mptcp/fastopen.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index bceaab8dd8e4..74698582a285 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -52,6 +52,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
mptcp_set_owner_r(skb, sk);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ mptcp_sk(sk)->bytes_received += skb->len;
sk->sk_data_ready(sk);
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next 3/9] mptcp: use plain bool instead of custom binary enum
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 1/9] mptcp: add a new sysctl for make after break timeout Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 2/9] mptcp: properly account fastopen data Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 4/9] tcp: define initial scaling factor value as a macro Mat Martineau
` (6 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
The 'data_avail' subflow field is already used as plain boolean,
drop the custom binary enum type and switch to bool.
No functional changed intended.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
net/mptcp/protocol.h | 7 +------
net/mptcp/subflow.c | 12 ++++++------
2 files changed, 7 insertions(+), 12 deletions(-)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 02556921bc6c..4c9e7ade160d 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -432,11 +432,6 @@ mptcp_subflow_rsk(const struct request_sock *rsk)
return (struct mptcp_subflow_request_sock *)rsk;
}
-enum mptcp_data_avail {
- MPTCP_SUBFLOW_NODATA,
- MPTCP_SUBFLOW_DATA_AVAIL,
-};
-
struct mptcp_delegated_action {
struct napi_struct napi;
struct list_head head;
@@ -492,7 +487,7 @@ struct mptcp_subflow_context {
valid_csum_seen : 1, /* at least one csum validated */
is_mptfo : 1, /* subflow is doing TFO */
__unused : 9;
- enum mptcp_data_avail data_avail;
+ bool data_avail;
bool scheduled;
u32 remote_nonce;
u64 thmac;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9c1f8d1d63d2..dbc7a52b322f 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1237,7 +1237,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
struct sk_buff *skb;
if (!skb_peek(&ssk->sk_receive_queue))
- WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+ WRITE_ONCE(subflow->data_avail, false);
if (subflow->data_avail)
return true;
@@ -1271,7 +1271,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
continue;
}
- WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_DATA_AVAIL);
+ WRITE_ONCE(subflow->data_avail, true);
break;
}
return true;
@@ -1293,7 +1293,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
goto reset;
}
mptcp_subflow_fail(msk, ssk);
- WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_DATA_AVAIL);
+ WRITE_ONCE(subflow->data_avail, true);
return true;
}
@@ -1310,7 +1310,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
while ((skb = skb_peek(&ssk->sk_receive_queue)))
sk_eat_skb(ssk, skb);
tcp_send_active_reset(ssk, GFP_ATOMIC);
- WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+ WRITE_ONCE(subflow->data_avail, false);
return false;
}
@@ -1322,7 +1322,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
subflow->map_seq = READ_ONCE(msk->ack_seq);
subflow->map_data_len = skb->len;
subflow->map_subflow_seq = tcp_sk(ssk)->copied_seq - subflow->ssn_offset;
- WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_DATA_AVAIL);
+ WRITE_ONCE(subflow->data_avail, true);
return true;
}
@@ -1334,7 +1334,7 @@ bool mptcp_subflow_data_available(struct sock *sk)
if (subflow->map_valid &&
mptcp_subflow_get_map_offset(subflow) >= subflow->map_data_len) {
subflow->map_valid = 0;
- WRITE_ONCE(subflow->data_avail, MPTCP_SUBFLOW_NODATA);
+ WRITE_ONCE(subflow->data_avail, false);
pr_debug("Done with mapping: seq=%u data_len=%u",
subflow->map_subflow_seq,
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next 4/9] tcp: define initial scaling factor value as a macro
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
` (2 preceding siblings ...)
2023-10-23 20:44 ` [PATCH net-next 3/9] mptcp: use plain bool instead of custom binary enum Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 5/9] mptcp: give rcvlowat some love Mat Martineau
` (5 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
So that other users could access it. Notably MPTCP will use
it in the next patch.
No functional change intended.
Acked-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
include/net/tcp.h | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 39b731c900dd..993b7fcd4e46 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1489,13 +1489,15 @@ static inline int tcp_space_from_win(const struct sock *sk, int win)
return __tcp_space_from_win(tcp_sk(sk)->scaling_ratio, win);
}
+/* Assume a conservative default of 1200 bytes of payload per 4K page.
+ * This may be adjusted later in tcp_measure_rcv_mss().
+ */
+#define TCP_DEFAULT_SCALING_RATIO ((1200 << TCP_RMEM_TO_WIN_SCALE) / \
+ SKB_TRUESIZE(4096))
+
static inline void tcp_scaling_ratio_init(struct sock *sk)
{
- /* Assume a conservative default of 1200 bytes of payload per 4K page.
- * This may be adjusted later in tcp_measure_rcv_mss().
- */
- tcp_sk(sk)->scaling_ratio = (1200 << TCP_RMEM_TO_WIN_SCALE) /
- SKB_TRUESIZE(4096);
+ tcp_sk(sk)->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
}
/* Note: caller must be prepared to deal with negative returns */
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next 5/9] mptcp: give rcvlowat some love
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
` (3 preceding siblings ...)
2023-10-23 20:44 ` [PATCH net-next 4/9] tcp: define initial scaling factor value as a macro Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 6/9] mptcp: use copy_from_iter helpers on transmit Mat Martineau
` (4 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
The MPTCP protocol allow setting sk_rcvlowat, but the value there
is currently ignored.
Additionally, the default subflows sk_rcvlowat basically disables per
subflow delayed ack: the MPTCP protocol move the incoming data from the
subflows into the msk socket as soon as the TCP stacks invokes the subflow
data_ready callback. Later, when __tcp_ack_snd_check() takes action,
the subflow-level copied_seq matches rcv_nxt, and that mandate for an
immediate ack.
Let the mptcp receive path be aware of such threshold, explicitly tracking
the amount of data available to be ready and checking vs sk_rcvlowat in
mptcp_poll() and before waking-up readers.
Additionally implement the set_rcvlowat() callback, to properly handle
the rcvbuf auto-tuning on sk_rcvlowat changes.
Finally to properly handle delayed ack, force the subflow level threshold
to 0 and instead explicitly ask for an immediate ack when the msk level th
is not reached.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
net/mptcp/protocol.c | 24 +++++++++++-------------
net/mptcp/protocol.h | 20 ++++++++++++++++++++
net/mptcp/sockopt.c | 42 ++++++++++++++++++++++++++++++++++++++++++
net/mptcp/subflow.c | 12 ++++++++++--
4 files changed, 83 insertions(+), 15 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a21f8ed26343..7036e30c449f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -863,9 +863,8 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
/* Wake-up the reader only for in-sequence data */
mptcp_data_lock(sk);
- if (move_skbs_to_msk(msk, ssk))
+ if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
sk->sk_data_ready(sk);
-
mptcp_data_unlock(sk);
}
@@ -1922,6 +1921,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
if (!(flags & MSG_PEEK)) {
MPTCP_SKB_CB(skb)->offset += count;
MPTCP_SKB_CB(skb)->map_seq += count;
+ msk->bytes_consumed += count;
}
break;
}
@@ -1932,6 +1932,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
__skb_unlink(skb, &msk->receive_queue);
__kfree_skb(skb);
+ msk->bytes_consumed += count;
}
if (copied >= len)
@@ -2755,6 +2756,7 @@ static void __mptcp_init_sock(struct sock *sk)
msk->rmem_fwd_alloc = 0;
WRITE_ONCE(msk->rmem_released, 0);
msk->timer_ival = TCP_RTO_MIN;
+ msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
WRITE_ONCE(msk->first, NULL);
inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss;
@@ -2964,16 +2966,9 @@ void __mptcp_unaccepted_force_close(struct sock *sk)
__mptcp_destroy_sock(sk);
}
-static __poll_t mptcp_check_readable(struct mptcp_sock *msk)
+static __poll_t mptcp_check_readable(struct sock *sk)
{
- /* Concurrent splices from sk_receive_queue into receive_queue will
- * always show at least one non-empty queue when checked in this order.
- */
- if (skb_queue_empty_lockless(&((struct sock *)msk)->sk_receive_queue) &&
- skb_queue_empty_lockless(&msk->receive_queue))
- return 0;
-
- return EPOLLIN | EPOLLRDNORM;
+ return mptcp_epollin_ready(sk) ? EPOLLIN | EPOLLRDNORM : 0;
}
static void mptcp_check_listen_stop(struct sock *sk)
@@ -3011,7 +3006,7 @@ bool __mptcp_close(struct sock *sk, long timeout)
goto cleanup;
}
- if (mptcp_check_readable(msk) || timeout < 0) {
+ if (mptcp_data_avail(msk) || timeout < 0) {
/* If the msk has read data, or the caller explicitly ask it,
* do the MPTCP equivalent of TCP reset, aka MPTCP fastclose
*/
@@ -3138,6 +3133,7 @@ static int mptcp_disconnect(struct sock *sk, int flags)
msk->snd_data_fin_enable = false;
msk->rcv_fastclose = false;
msk->use_64bit_ack = false;
+ msk->bytes_consumed = 0;
WRITE_ONCE(msk->csum_enabled, mptcp_is_checksum_enabled(sock_net(sk)));
mptcp_pm_data_reset(msk);
mptcp_ca_reset(sk);
@@ -3909,7 +3905,7 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
if (state != TCP_SYN_SENT && state != TCP_SYN_RECV) {
- mask |= mptcp_check_readable(msk);
+ mask |= mptcp_check_readable(sk);
if (shutdown & SEND_SHUTDOWN)
mask |= EPOLLOUT | EPOLLWRNORM;
else
@@ -3947,6 +3943,7 @@ static const struct proto_ops mptcp_stream_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
+ .set_rcvlowat = mptcp_set_rcvlowat,
};
static struct inet_protosw mptcp_protosw = {
@@ -4048,6 +4045,7 @@ static const struct proto_ops mptcp_v6_stream_ops = {
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
+ .set_rcvlowat = mptcp_set_rcvlowat,
};
static struct proto mptcp_v6_prot;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 4c9e7ade160d..620a82cd4c10 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -267,6 +267,7 @@ struct mptcp_sock {
atomic64_t rcv_wnd_sent;
u64 rcv_data_fin_seq;
u64 bytes_retrans;
+ u64 bytes_consumed;
int rmem_fwd_alloc;
int snd_burst;
int old_wspace;
@@ -657,6 +658,24 @@ struct sock *mptcp_subflow_get_retrans(struct mptcp_sock *msk);
int mptcp_sched_get_send(struct mptcp_sock *msk);
int mptcp_sched_get_retrans(struct mptcp_sock *msk);
+static inline u64 mptcp_data_avail(const struct mptcp_sock *msk)
+{
+ return READ_ONCE(msk->bytes_received) - READ_ONCE(msk->bytes_consumed);
+}
+
+static inline bool mptcp_epollin_ready(const struct sock *sk)
+{
+ /* mptcp doesn't have to deal with small skbs in the receive queue,
+ * at it can always coalesce them
+ */
+ return (mptcp_data_avail(mptcp_sk(sk)) >= sk->sk_rcvlowat) ||
+ (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+ mem_cgroup_under_socket_pressure(sk->sk_memcg)) ||
+ READ_ONCE(tcp_memory_pressure);
+}
+
+int mptcp_set_rcvlowat(struct sock *sk, int val);
+
static inline bool __tcp_can_send(const struct sock *ssk)
{
/* only send if our side has not closed yet */
@@ -731,6 +750,7 @@ static inline bool mptcp_is_fully_established(struct sock *sk)
return inet_sk_state_load(sk) == TCP_ESTABLISHED &&
READ_ONCE(mptcp_sk(sk)->fully_established);
}
+
void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk);
void mptcp_data_ready(struct sock *sk, struct sock *ssk);
bool mptcp_finish_join(struct sock *sk);
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 59bd5e114392..d15891e23f45 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1472,9 +1472,51 @@ void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk)
msk_owned_by_me(msk);
+ ssk->sk_rcvlowat = 0;
+
if (READ_ONCE(subflow->setsockopt_seq) != msk->setsockopt_seq) {
sync_socket_options(msk, ssk);
subflow->setsockopt_seq = msk->setsockopt_seq;
}
}
+
+/* unfortunately this is different enough from the tcp version so
+ * that we can't factor it out
+ */
+int mptcp_set_rcvlowat(struct sock *sk, int val)
+{
+ struct mptcp_subflow_context *subflow;
+ int space, cap;
+
+ if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+ cap = sk->sk_rcvbuf >> 1;
+ else
+ cap = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]) >> 1;
+ val = min(val, cap);
+ WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
+
+ /* Check if we need to signal EPOLLIN right now */
+ if (mptcp_epollin_ready(sk))
+ sk->sk_data_ready(sk);
+
+ if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+ return 0;
+
+ space = __tcp_space_from_win(mptcp_sk(sk)->scaling_ratio, val);
+ if (space <= sk->sk_rcvbuf)
+ return 0;
+
+ /* propagate the rcvbuf changes to all the subflows */
+ WRITE_ONCE(sk->sk_rcvbuf, space);
+ mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ bool slow;
+
+ slow = lock_sock_fast(ssk);
+ WRITE_ONCE(ssk->sk_rcvbuf, space);
+ tcp_sk(ssk)->window_clamp = val;
+ unlock_sock_fast(ssk, slow);
+ }
+ return 0;
+}
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index dbc7a52b322f..080b16426222 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1405,10 +1405,18 @@ static void subflow_data_ready(struct sock *sk)
WARN_ON_ONCE(!__mptcp_check_fallback(msk) && !subflow->mp_capable &&
!subflow->mp_join && !(state & TCPF_CLOSE));
- if (mptcp_subflow_data_available(sk))
+ if (mptcp_subflow_data_available(sk)) {
mptcp_data_ready(parent, sk);
- else if (unlikely(sk->sk_err))
+
+ /* subflow-level lowat test are not relevant.
+ * respect the msk-level threshold eventually mandating an immediate ack
+ */
+ if (mptcp_data_avail(msk) < parent->sk_rcvlowat &&
+ (tcp_sk(sk)->rcv_nxt - tcp_sk(sk)->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss)
+ inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+ } else if (unlikely(sk->sk_err)) {
subflow_error_report(sk);
+ }
}
static void subflow_write_space(struct sock *ssk)
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next 6/9] mptcp: use copy_from_iter helpers on transmit
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
` (4 preceding siblings ...)
2023-10-23 20:44 ` [PATCH net-next 5/9] mptcp: give rcvlowat some love Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 7/9] mptcp: consolidate sockopt synchronization Mat Martineau
` (3 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
The perf traces show an high cost for the MPTCP transmit path memcpy.
It turn out that the helper currently in use carries quite a bit
of unneeded overhead, e.g. to map/unmap the memory pages.
Moving to the 'copy_from_iter' variant removes such overhead and
additionally gains the no-cache support.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
net/mptcp/protocol.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 7036e30c449f..5489f024dd7e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1760,6 +1760,18 @@ static int mptcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg,
return ret;
}
+static int do_copy_data_nocache(struct sock *sk, int copy,
+ struct iov_iter *from, char *to)
+{
+ if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY) {
+ if (!copy_from_iter_full_nocache(to, copy, from))
+ return -EFAULT;
+ } else if (!copy_from_iter_full(to, copy, from)) {
+ return -EFAULT;
+ }
+ return 0;
+}
+
static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
{
struct mptcp_sock *msk = mptcp_sk(sk);
@@ -1833,11 +1845,10 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
if (!sk_wmem_schedule(sk, total_ts))
goto wait_for_memory;
- if (copy_page_from_iter(dfrag->page, offset, psize,
- &msg->msg_iter) != psize) {
- ret = -EFAULT;
+ ret = do_copy_data_nocache(sk, psize, &msg->msg_iter,
+ page_address(dfrag->page) + offset);
+ if (ret)
goto do_error;
- }
/* data successfully copied into the write queue */
sk_forward_alloc_add(sk, -total_ts);
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next 7/9] mptcp: consolidate sockopt synchronization
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
` (5 preceding siblings ...)
2023-10-23 20:44 ` [PATCH net-next 6/9] mptcp: use copy_from_iter helpers on transmit Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 8/9] mptcp: ignore notsent_lowat setting at the subflow level Mat Martineau
` (2 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
Move the socket option synchronization for active subflows
at subflow creation time. This allows removing the now unused
unlocked variant of such helper.
While at that, clean-up a bit the mptcp_subflow_create_socket()
errors path.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
net/mptcp/protocol.c | 2 --
net/mptcp/sockopt.c | 22 ----------------------
net/mptcp/subflow.c | 18 +++++++++---------
3 files changed, 9 insertions(+), 33 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 5489f024dd7e..e44a3da12b96 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -121,8 +121,6 @@ struct sock *__mptcp_nmpc_sk(struct mptcp_sock *msk)
ret = __mptcp_socket_create(msk);
if (ret)
return ERR_PTR(ret);
-
- mptcp_sockopt_sync(msk, msk->first);
}
return msk->first;
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index d15891e23f45..abf0645cb65d 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1444,28 +1444,6 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
inet_assign_bit(FREEBIND, ssk, inet_test_bit(FREEBIND, sk));
}
-static void __mptcp_sockopt_sync(struct mptcp_sock *msk, struct sock *ssk)
-{
- bool slow = lock_sock_fast(ssk);
-
- sync_socket_options(msk, ssk);
-
- unlock_sock_fast(ssk, slow);
-}
-
-void mptcp_sockopt_sync(struct mptcp_sock *msk, struct sock *ssk)
-{
- struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
-
- msk_owned_by_me(msk);
-
- if (READ_ONCE(subflow->setsockopt_seq) != msk->setsockopt_seq) {
- __mptcp_sockopt_sync(msk, ssk);
-
- subflow->setsockopt_seq = msk->setsockopt_seq;
- }
-}
-
void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk)
{
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 080b16426222..df208666fd19 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1533,8 +1533,6 @@ int __mptcp_subflow_connect(struct sock *sk, const struct mptcp_addr_info *loc,
if (addr.ss_family == AF_INET6)
addrlen = sizeof(struct sockaddr_in6);
#endif
- mptcp_sockopt_sync(msk, ssk);
-
ssk->sk_bound_dev_if = ifindex;
err = kernel_bind(sf, (struct sockaddr *)&addr, addrlen);
if (err)
@@ -1645,7 +1643,7 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
err = security_mptcp_add_subflow(sk, sf->sk);
if (err)
- goto release_ssk;
+ goto err_free;
/* the newly created socket has to be in the same cgroup as its parent */
mptcp_attach_cgroup(sk, sf->sk);
@@ -1659,15 +1657,12 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
get_net_track(net, &sf->sk->ns_tracker, GFP_KERNEL);
sock_inuse_add(net, 1);
err = tcp_set_ulp(sf->sk, "mptcp");
+ if (err)
+ goto err_free;
-release_ssk:
+ mptcp_sockopt_sync_locked(mptcp_sk(sk), sf->sk);
release_sock(sf->sk);
- if (err) {
- sock_release(sf);
- return err;
- }
-
/* the newly created socket really belongs to the owning MPTCP master
* socket, even if for additional subflows the allocation is performed
* by a kernel workqueue. Adjust inode references, so that the
@@ -1687,6 +1682,11 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
mptcp_subflow_ops_override(sf->sk);
return 0;
+
+err_free:
+ release_sock(sf->sk);
+ sock_release(sf);
+ return err;
}
static struct mptcp_subflow_context *subflow_create_ctx(struct sock *sk,
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next 8/9] mptcp: ignore notsent_lowat setting at the subflow level
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
` (6 preceding siblings ...)
2023-10-23 20:44 ` [PATCH net-next 7/9] mptcp: consolidate sockopt synchronization Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-10-23 20:44 ` [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning Mat Martineau
2023-10-25 19:50 ` [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 patchwork-bot+netdevbpf
9 siblings, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
Any latency related tuning taking action at the subflow level does
not really affect the user-space, as only the main MPTCP socket is
relevant.
Anyway any limiting setting may foul the MPTCP scheduler, not being
able to fully use the subflow-level cwin, leading to very poor b/w
usage.
Enforce notsent_lowat to be a no-op on every subflow.
Note that TCP_NOTSENT_LOWAT is currently not supported, and properly
dealing with that will require more invasive changes.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
net/mptcp/sockopt.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index abf0645cb65d..72858d7d8974 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -1452,6 +1452,12 @@ void mptcp_sockopt_sync_locked(struct mptcp_sock *msk, struct sock *ssk)
ssk->sk_rcvlowat = 0;
+ /* subflows must ignore any latency-related settings: will not affect
+ * the user-space - only the msk is relevant - but will foul the
+ * mptcp scheduler
+ */
+ tcp_sk(ssk)->notsent_lowat = UINT_MAX;
+
if (READ_ONCE(subflow->setsockopt_seq) != msk->setsockopt_seq) {
sync_socket_options(msk, ssk);
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
` (7 preceding siblings ...)
2023-10-23 20:44 ` [PATCH net-next 8/9] mptcp: ignore notsent_lowat setting at the subflow level Mat Martineau
@ 2023-10-23 20:44 ` Mat Martineau
2023-11-02 17:19 ` Eric Dumazet
2023-10-25 19:50 ` [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 patchwork-bot+netdevbpf
9 siblings, 1 reply; 17+ messages in thread
From: Mat Martineau @ 2023-10-23 20:44 UTC (permalink / raw)
To: Matthieu Baerts, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, mptcp, Mat Martineau
From: Paolo Abeni <pabeni@redhat.com>
The MPTCP protocol account for the data enqueued on all the subflows
to the main socket send buffer, while the send buffer auto-tuning
algorithm set the main socket send buffer size as the max size among
the subflows.
That causes bad performances when at least one subflow is sndbuf
limited, e.g. due to very high latency, as the MPTCP scheduler can't
even fill such buffer.
Change the send-buffer auto-tuning algorithm to compute the main socket
send buffer size as the sum of all the subflows buffer size.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
---
net/mptcp/protocol.c | 18 ++++++++++++++++--
net/mptcp/protocol.h | 54 +++++++++++++++++++++++++++++++++++++++++++++++-----
net/mptcp/sockopt.c | 5 ++++-
net/mptcp/subflow.c | 3 +--
4 files changed, 70 insertions(+), 10 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index e44a3da12b96..1dacc072dcca 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -890,6 +890,7 @@ static bool __mptcp_finish_join(struct mptcp_sock *msk, struct sock *ssk)
mptcp_sockopt_sync_locked(msk, ssk);
mptcp_subflow_joined(msk, ssk);
mptcp_stop_tout_timer(sk);
+ __mptcp_propagate_sndbuf(sk, ssk);
return true;
}
@@ -1076,15 +1077,16 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
struct mptcp_sock *msk = mptcp_sk(sk);
bool first = true;
- sk_stream_moderate_sndbuf(sk);
mptcp_for_each_subflow(msk, subflow) {
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
if (first)
tcp_enter_memory_pressure(ssk);
sk_stream_moderate_sndbuf(ssk);
+
first = false;
}
+ __mptcp_sync_sndbuf(sk);
}
/* ensure we get enough memory for the frag hdr, beyond some minimal amount of
@@ -2458,6 +2460,7 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
WRITE_ONCE(msk->first, NULL);
out:
+ __mptcp_sync_sndbuf(sk);
if (need_push)
__mptcp_push_pending(sk, 0);
@@ -3224,7 +3227,7 @@ struct sock *mptcp_sk_clone_init(const struct sock *sk,
* uses the correct data
*/
mptcp_copy_inaddrs(nsk, ssk);
- mptcp_propagate_sndbuf(nsk, ssk);
+ __mptcp_propagate_sndbuf(nsk, ssk);
mptcp_rcv_space_init(msk, ssk);
bh_unlock_sock(nsk);
@@ -3402,6 +3405,8 @@ static void mptcp_release_cb(struct sock *sk)
__mptcp_set_connected(sk);
if (__test_and_clear_bit(MPTCP_ERROR_REPORT, &msk->cb_flags))
__mptcp_error_report(sk);
+ if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags))
+ __mptcp_sync_sndbuf(sk);
}
__mptcp_update_rmem(sk);
@@ -3446,6 +3451,14 @@ void mptcp_subflow_process_delegated(struct sock *ssk, long status)
__set_bit(MPTCP_PUSH_PENDING, &mptcp_sk(sk)->cb_flags);
mptcp_data_unlock(sk);
}
+ if (status & BIT(MPTCP_DELEGATE_SNDBUF)) {
+ mptcp_data_lock(sk);
+ if (!sock_owned_by_user(sk))
+ __mptcp_sync_sndbuf(sk);
+ else
+ __set_bit(MPTCP_SYNC_SNDBUF, &mptcp_sk(sk)->cb_flags);
+ mptcp_data_unlock(sk);
+ }
if (status & BIT(MPTCP_DELEGATE_ACK))
schedule_3rdack_retransmission(ssk);
}
@@ -3530,6 +3543,7 @@ bool mptcp_finish_join(struct sock *ssk)
/* active subflow, already present inside the conn_list */
if (!list_empty(&subflow->node)) {
mptcp_subflow_joined(msk, ssk);
+ mptcp_propagate_sndbuf(parent, ssk);
return true;
}
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 620a82cd4c10..296d01965943 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -123,6 +123,7 @@
#define MPTCP_RETRANSMIT 4
#define MPTCP_FLUSH_JOIN_LIST 5
#define MPTCP_CONNECTED 6
+#define MPTCP_SYNC_SNDBUF 7
struct mptcp_skb_cb {
u64 map_seq;
@@ -443,6 +444,7 @@ DECLARE_PER_CPU(struct mptcp_delegated_action, mptcp_delegated_actions);
#define MPTCP_DELEGATE_SCHEDULED 0
#define MPTCP_DELEGATE_SEND 1
#define MPTCP_DELEGATE_ACK 2
+#define MPTCP_DELEGATE_SNDBUF 3
#define MPTCP_DELEGATE_ACTIONS_MASK (~BIT(MPTCP_DELEGATE_SCHEDULED))
/* MPTCP subflow context */
@@ -516,6 +518,9 @@ struct mptcp_subflow_context {
u32 setsockopt_seq;
u32 stale_rcv_tstamp;
+ int cached_sndbuf; /* sndbuf size when last synced with the msk sndbuf,
+ * protected by the msk socket lock
+ */
struct sock *tcp_sock; /* tcp sk backpointer */
struct sock *conn; /* parent mptcp_sock */
@@ -778,13 +783,52 @@ static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk)
READ_ONCE(msk->write_seq) == READ_ONCE(msk->snd_nxt);
}
-static inline bool mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk)
+static inline void __mptcp_sync_sndbuf(struct sock *sk)
{
- if ((sk->sk_userlocks & SOCK_SNDBUF_LOCK) || ssk->sk_sndbuf <= READ_ONCE(sk->sk_sndbuf))
- return false;
+ struct mptcp_subflow_context *subflow;
+ int ssk_sndbuf, new_sndbuf;
+
+ if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
+ return;
+
+ new_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[0];
+ mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
+ ssk_sndbuf = READ_ONCE(mptcp_subflow_tcp_sock(subflow)->sk_sndbuf);
+
+ subflow->cached_sndbuf = ssk_sndbuf;
+ new_sndbuf += ssk_sndbuf;
+ }
+
+ /* the msk max wmem limit is <nr_subflows> * tcp wmem[2] */
+ WRITE_ONCE(sk->sk_sndbuf, new_sndbuf);
+}
+
+/* The called held both the msk socket and the subflow socket locks,
+ * possibly under BH
+ */
+static inline void __mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk)
+{
+ struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+
+ if (READ_ONCE(ssk->sk_sndbuf) != subflow->cached_sndbuf)
+ __mptcp_sync_sndbuf(sk);
+}
+
+/* the caller held only the subflow socket lock, either in process or
+ * BH context. Additionally this can be called under the msk data lock,
+ * so we can't acquire such lock here: let the delegate action acquires
+ * the needed locks in suitable order.
+ */
+static inline void mptcp_propagate_sndbuf(struct sock *sk, struct sock *ssk)
+{
+ struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+
+ if (likely(READ_ONCE(ssk->sk_sndbuf) == subflow->cached_sndbuf))
+ return;
- WRITE_ONCE(sk->sk_sndbuf, ssk->sk_sndbuf);
- return true;
+ local_bh_disable();
+ mptcp_subflow_delegate(subflow, MPTCP_DELEGATE_SNDBUF);
+ local_bh_enable();
}
static inline void mptcp_write_space(struct sock *sk)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 72858d7d8974..574e221bb765 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -95,6 +95,7 @@ static void mptcp_sol_socket_sync_intval(struct mptcp_sock *msk, int optname, in
case SO_SNDBUFFORCE:
ssk->sk_userlocks |= SOCK_SNDBUF_LOCK;
WRITE_ONCE(ssk->sk_sndbuf, sk->sk_sndbuf);
+ mptcp_subflow_ctx(ssk)->cached_sndbuf = sk->sk_sndbuf;
break;
case SO_RCVBUF:
case SO_RCVBUFFORCE:
@@ -1415,8 +1416,10 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
if (sk->sk_userlocks & tx_rx_locks) {
ssk->sk_userlocks |= sk->sk_userlocks & tx_rx_locks;
- if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
+ if (sk->sk_userlocks & SOCK_SNDBUF_LOCK) {
WRITE_ONCE(ssk->sk_sndbuf, sk->sk_sndbuf);
+ mptcp_subflow_ctx(ssk)->cached_sndbuf = sk->sk_sndbuf;
+ }
if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
WRITE_ONCE(ssk->sk_rcvbuf, sk->sk_rcvbuf);
}
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index df208666fd19..2b43577f952e 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc
void __mptcp_set_connected(struct sock *sk)
{
+ __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first);
if (sk->sk_state == TCP_SYN_SENT) {
inet_sk_state_store(sk, TCP_ESTABLISHED);
sk->sk_state_change(sk);
@@ -472,7 +473,6 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
return;
msk = mptcp_sk(parent);
- mptcp_propagate_sndbuf(parent, sk);
subflow->rel_write_seq = 1;
subflow->conn_finished = 1;
subflow->ssn_offset = TCP_SKB_CB(skb)->seq;
@@ -1736,7 +1736,6 @@ static void subflow_state_change(struct sock *sk)
msk = mptcp_sk(parent);
if (subflow_simultaneous_connect(sk)) {
- mptcp_propagate_sndbuf(parent, sk);
mptcp_do_fallback(sk);
mptcp_rcv_space_init(msk, sk);
pr_fallback(msk);
--
2.41.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning
2023-10-23 20:44 ` [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning Mat Martineau
@ 2023-11-02 17:19 ` Eric Dumazet
2023-11-02 18:38 ` Mat Martineau
2023-11-06 7:21 ` Paolo Abeni
0 siblings, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2023-11-02 17:19 UTC (permalink / raw)
To: Mat Martineau
Cc: Matthieu Baerts, David S. Miller, Jakub Kicinski, Paolo Abeni,
netdev, mptcp
On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote:
>
> From: Paolo Abeni <pabeni@redhat.com>
>
> The MPTCP protocol account for the data enqueued on all the subflows
> to the main socket send buffer, while the send buffer auto-tuning
> algorithm set the main socket send buffer size as the max size among
> the subflows.
>
> That causes bad performances when at least one subflow is sndbuf
> limited, e.g. due to very high latency, as the MPTCP scheduler can't
> even fill such buffer.
>
> Change the send-buffer auto-tuning algorithm to compute the main socket
> send buffer size as the sum of all the subflows buffer size.
>
> Reviewed-by: Mat Martineau <martineau@kernel.org>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Mat Martineau <martineau@kernel.org
...
> diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
> index df208666fd19..2b43577f952e 100644
> --- a/net/mptcp/subflow.c
> +++ b/net/mptcp/subflow.c
> @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc
>
> void __mptcp_set_connected(struct sock *sk)
> {
> + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first);
->first can be NULL here, according to syzbot.
> if (sk->sk_state == TCP_SYN_SENT) {
> inet_sk_state_store(sk, TCP_ESTABLISHED);
> sk->sk_state_change(sk);
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning
2023-11-02 17:19 ` Eric Dumazet
@ 2023-11-02 18:38 ` Mat Martineau
2023-11-06 7:21 ` Paolo Abeni
1 sibling, 0 replies; 17+ messages in thread
From: Mat Martineau @ 2023-11-02 18:38 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni, Matthieu Baerts
Cc: David S. Miller, Jakub Kicinski, netdev, mptcp
[-- Attachment #1: Type: text/plain, Size: 1512 bytes --]
On Thu, 2 Nov 2023, Eric Dumazet wrote:
> On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote:
>>
>> From: Paolo Abeni <pabeni@redhat.com>
>>
>> The MPTCP protocol account for the data enqueued on all the subflows
>> to the main socket send buffer, while the send buffer auto-tuning
>> algorithm set the main socket send buffer size as the max size among
>> the subflows.
>>
>> That causes bad performances when at least one subflow is sndbuf
>> limited, e.g. due to very high latency, as the MPTCP scheduler can't
>> even fill such buffer.
>>
>> Change the send-buffer auto-tuning algorithm to compute the main socket
>> send buffer size as the sum of all the subflows buffer size.
>>
>> Reviewed-by: Mat Martineau <martineau@kernel.org>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> Signed-off-by: Mat Martineau <martineau@kernel.org
>
> ...
>
>> diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
>> index df208666fd19..2b43577f952e 100644
>> --- a/net/mptcp/subflow.c
>> +++ b/net/mptcp/subflow.c
>> @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc
>>
>> void __mptcp_set_connected(struct sock *sk)
>> {
>> + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first);
>
> ->first can be NULL here, according to syzbot.
>
Thanks for reporting this, Eric. We will get a fix to the net tree.
Paolo & Matthieu, I created a github issue to track this:
https://github.com/multipath-tcp/mptcp_net-next/issues/454
- Mat
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning
2023-11-02 17:19 ` Eric Dumazet
2023-11-02 18:38 ` Mat Martineau
@ 2023-11-06 7:21 ` Paolo Abeni
2023-11-06 8:18 ` Eric Dumazet
1 sibling, 1 reply; 17+ messages in thread
From: Paolo Abeni @ 2023-11-06 7:21 UTC (permalink / raw)
To: Eric Dumazet, Mat Martineau
Cc: Matthieu Baerts, David S. Miller, Jakub Kicinski, netdev, mptcp
Hi,
On Thu, 2023-11-02 at 18:19 +0100, Eric Dumazet wrote:
> On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote:
> >
> > From: Paolo Abeni <pabeni@redhat.com>
> >
> > The MPTCP protocol account for the data enqueued on all the subflows
> > to the main socket send buffer, while the send buffer auto-tuning
> > algorithm set the main socket send buffer size as the max size among
> > the subflows.
> >
> > That causes bad performances when at least one subflow is sndbuf
> > limited, e.g. due to very high latency, as the MPTCP scheduler can't
> > even fill such buffer.
> >
> > Change the send-buffer auto-tuning algorithm to compute the main socket
> > send buffer size as the sum of all the subflows buffer size.
> >
> > Reviewed-by: Mat Martineau <martineau@kernel.org>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > Signed-off-by: Mat Martineau <martineau@kernel.org
>
> ...
>
> > diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
> > index df208666fd19..2b43577f952e 100644
> > --- a/net/mptcp/subflow.c
> > +++ b/net/mptcp/subflow.c
> > @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc
> >
> > void __mptcp_set_connected(struct sock *sk)
> > {
> > + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first);
>
> ->first can be NULL here, according to syzbot.
I'm sorry for the latency on my side, I had a different kind of crash
to handle here.
Do you have a syzkaller report available? Or the call trace landing
here?
Thanks!
Paolo
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning
2023-11-06 7:21 ` Paolo Abeni
@ 2023-11-06 8:18 ` Eric Dumazet
0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2023-11-06 8:18 UTC (permalink / raw)
To: Paolo Abeni
Cc: Mat Martineau, Matthieu Baerts, David S. Miller, Jakub Kicinski,
netdev, mptcp
On Mon, Nov 6, 2023 at 8:22 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On Thu, 2023-11-02 at 18:19 +0100, Eric Dumazet wrote:
> > On Mon, Oct 23, 2023 at 10:45 PM Mat Martineau <martineau@kernel.org> wrote:
> > >
> > > From: Paolo Abeni <pabeni@redhat.com>
> > >
> > > The MPTCP protocol account for the data enqueued on all the subflows
> > > to the main socket send buffer, while the send buffer auto-tuning
> > > algorithm set the main socket send buffer size as the max size among
> > > the subflows.
> > >
> > > That causes bad performances when at least one subflow is sndbuf
> > > limited, e.g. due to very high latency, as the MPTCP scheduler can't
> > > even fill such buffer.
> > >
> > > Change the send-buffer auto-tuning algorithm to compute the main socket
> > > send buffer size as the sum of all the subflows buffer size.
> > >
> > > Reviewed-by: Mat Martineau <martineau@kernel.org>
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > Signed-off-by: Mat Martineau <martineau@kernel.org
> >
> > ...
> >
> > > diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
> > > index df208666fd19..2b43577f952e 100644
> > > --- a/net/mptcp/subflow.c
> > > +++ b/net/mptcp/subflow.c
> > > @@ -421,6 +421,7 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc
> > >
> > > void __mptcp_set_connected(struct sock *sk)
> > > {
> > > + __mptcp_propagate_sndbuf(sk, mptcp_sk(sk)->first);
> >
> > ->first can be NULL here, according to syzbot.
>
> I'm sorry for the latency on my side, I had a different kind of crash
> to handle here.
>
> Do you have a syzkaller report available? Or the call trace landing
> here?
Sure, let me release the report.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net-next 0/9] mptcp: Features and fixes for v6.7
2023-10-23 20:44 [PATCH net-next 0/9] mptcp: Features and fixes for v6.7 Mat Martineau
` (8 preceding siblings ...)
2023-10-23 20:44 ` [PATCH net-next 9/9] mptcp: refactor sndbuf auto-tuning Mat Martineau
@ 2023-10-25 19:50 ` patchwork-bot+netdevbpf
9 siblings, 0 replies; 17+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-10-25 19:50 UTC (permalink / raw)
To: Mat Martineau; +Cc: matttbe, davem, edumazet, kuba, pabeni, netdev, mptcp
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 23 Oct 2023 13:44:33 -0700 you wrote:
> Patch 1 adds a configurable timeout for the MPTCP connection when all
> subflows are closed, to support break-before-make use cases.
>
> Patch 2 is a fix for a 1-byte error in rx data counters with MPTCP
> fastopen connections.
>
> Patch 3 is a minor code cleanup.
>
> [...]
Here is the summary with links:
- [net-next,1/9] mptcp: add a new sysctl for make after break timeout
https://git.kernel.org/netdev/net-next/c/d866ae9aaa43
- [net-next,2/9] mptcp: properly account fastopen data
https://git.kernel.org/netdev/net-next/c/bf0e96108fb6
- [net-next,3/9] mptcp: use plain bool instead of custom binary enum
https://git.kernel.org/netdev/net-next/c/f1f26512a9bf
- [net-next,4/9] tcp: define initial scaling factor value as a macro
https://git.kernel.org/netdev/net-next/c/849ee75a38b2
- [net-next,5/9] mptcp: give rcvlowat some love
https://git.kernel.org/netdev/net-next/c/5684ab1a0eff
- [net-next,6/9] mptcp: use copy_from_iter helpers on transmit
https://git.kernel.org/netdev/net-next/c/0ffe8e749040
- [net-next,7/9] mptcp: consolidate sockopt synchronization
https://git.kernel.org/netdev/net-next/c/a1ab24e5fc4a
- [net-next,8/9] mptcp: ignore notsent_lowat setting at the subflow level
https://git.kernel.org/netdev/net-next/c/9fdc779331bd
- [net-next,9/9] mptcp: refactor sndbuf auto-tuning
https://git.kernel.org/netdev/net-next/c/8005184fd1ca
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 17+ messages in thread