* [PATCH net-next 1/7] mptcp: consolidate subflow cleanup
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
@ 2025-02-18 18:36 ` Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 2/7] mptcp: drop __mptcp_fastopen_gen_msk_ackseq() Matthieu Baerts (NGI0)
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts (NGI0) @ 2025-02-18 18:36 UTC (permalink / raw)
To: mptcp, Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: Kuniyuki Iwashima, Willem de Bruijn, David Ahern,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, linux-kernel,
Matthieu Baerts (NGI0)
From: Paolo Abeni <pabeni@redhat.com>
Consolidate all the cleanup actions requiring the worker in a single
helper and ensure the dummy data fin creation for fallback socket is
performed only when the tcp rx queue is empty.
There are no functional changes intended, but this will simplify the
next patch, when the tcp rx queue spooling could be delayed at release_cb
time.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
net/mptcp/subflow.c | 33 ++++++++++++++++++---------------
1 file changed, 18 insertions(+), 15 deletions(-)
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index fd021cf8286eff9234b950a4d4c083ea7756eba3..2926bdf88e42c5f2db6875b00b4eca2dbf49dba2 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1271,7 +1271,12 @@ static void mptcp_subflow_discard_data(struct sock *ssk, struct sk_buff *skb,
subflow->map_valid = 0;
}
-/* sched mptcp worker to remove the subflow if no more data is pending */
+static bool subflow_is_done(const struct sock *sk)
+{
+ return sk->sk_shutdown & RCV_SHUTDOWN || sk->sk_state == TCP_CLOSE;
+}
+
+/* sched mptcp worker for subflow cleanup if no more data is pending */
static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ssk)
{
struct sock *sk = (struct sock *)msk;
@@ -1281,8 +1286,18 @@ static void subflow_sched_work_if_closed(struct mptcp_sock *msk, struct sock *ss
inet_sk_state_load(sk) != TCP_ESTABLISHED)))
return;
- if (skb_queue_empty(&ssk->sk_receive_queue) &&
- !test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
+ if (!skb_queue_empty(&ssk->sk_receive_queue))
+ return;
+
+ if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags))
+ mptcp_schedule_work(sk);
+
+ /* when the fallback subflow closes the rx side, trigger a 'dummy'
+ * ingress data fin, so that the msk state will follow along
+ */
+ if (__mptcp_check_fallback(msk) && subflow_is_done(ssk) &&
+ msk->first == ssk &&
+ mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
mptcp_schedule_work(sk);
}
@@ -1842,11 +1857,6 @@ static void __subflow_state_change(struct sock *sk)
rcu_read_unlock();
}
-static bool subflow_is_done(const struct sock *sk)
-{
- return sk->sk_shutdown & RCV_SHUTDOWN || sk->sk_state == TCP_CLOSE;
-}
-
static void subflow_state_change(struct sock *sk)
{
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
@@ -1873,13 +1883,6 @@ static void subflow_state_change(struct sock *sk)
subflow_error_report(sk);
subflow_sched_work_if_closed(mptcp_sk(parent), sk);
-
- /* when the fallback subflow closes the rx side, trigger a 'dummy'
- * ingress data fin, so that the msk state will follow along
- */
- if (__mptcp_check_fallback(msk) && subflow_is_done(sk) && msk->first == sk &&
- mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
- mptcp_schedule_work(parent);
}
void mptcp_subflow_queue_clean(struct sock *listener_sk, struct sock *listener_ssk)
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH net-next 2/7] mptcp: drop __mptcp_fastopen_gen_msk_ackseq()
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 1/7] mptcp: consolidate subflow cleanup Matthieu Baerts (NGI0)
@ 2025-02-18 18:36 ` Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 3/7] mptcp: move the whole rx path under msk socket lock protection Matthieu Baerts (NGI0)
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts (NGI0) @ 2025-02-18 18:36 UTC (permalink / raw)
To: mptcp, Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: Kuniyuki Iwashima, Willem de Bruijn, David Ahern,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, linux-kernel,
Matthieu Baerts (NGI0)
From: Paolo Abeni <pabeni@redhat.com>
When we will move the whole RX path under the msk socket lock, updating
the already queued skb for passive fastopen socket at 3rd ack time will
be extremely painful and race prone
The map_seq for already enqueued skbs is used only to allow correct
coalescing with later data; preventing collapsing to the first skb of
a fastopen connect we can completely remove the
__mptcp_fastopen_gen_msk_ackseq() helper.
Before dropping this helper, a new item had to be added to the
mptcp_skb_cb structure. Because this item will be frequently tested in
the fast path -- almost on every packet -- and because there is free
space there, a single byte is used instead of a bitfield. This micro
optimisation slightly reduces the number of CPU operations to do the
associated check.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
net/mptcp/fastopen.c | 24 ++----------------------
net/mptcp/protocol.c | 4 +++-
net/mptcp/protocol.h | 5 ++---
net/mptcp/subflow.c | 3 ---
4 files changed, 7 insertions(+), 29 deletions(-)
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index a29ff901df7588dec24e330ddd77a4aeb1462b68..7777f5a2d14379853fcd13c4b57c5569be05a2e4 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -40,13 +40,12 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
tp->copied_seq += skb->len;
subflow->ssn_offset += skb->len;
- /* initialize a dummy sequence number, we will update it at MPC
- * completion, if needed
- */
+ /* Only the sequence delta is relevant */
MPTCP_SKB_CB(skb)->map_seq = -skb->len;
MPTCP_SKB_CB(skb)->end_seq = 0;
MPTCP_SKB_CB(skb)->offset = 0;
MPTCP_SKB_CB(skb)->has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
+ MPTCP_SKB_CB(skb)->cant_coalesce = 1;
mptcp_data_lock(sk);
@@ -58,22 +57,3 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
mptcp_data_unlock(sk);
}
-
-void __mptcp_fastopen_gen_msk_ackseq(struct mptcp_sock *msk, struct mptcp_subflow_context *subflow,
- const struct mptcp_options_received *mp_opt)
-{
- struct sock *sk = (struct sock *)msk;
- struct sk_buff *skb;
-
- skb = skb_peek_tail(&sk->sk_receive_queue);
- if (skb) {
- WARN_ON_ONCE(MPTCP_SKB_CB(skb)->end_seq);
- pr_debug("msk %p moving seq %llx -> %llx end_seq %llx -> %llx\n", sk,
- MPTCP_SKB_CB(skb)->map_seq, MPTCP_SKB_CB(skb)->map_seq + msk->ack_seq,
- MPTCP_SKB_CB(skb)->end_seq, MPTCP_SKB_CB(skb)->end_seq + msk->ack_seq);
- MPTCP_SKB_CB(skb)->map_seq += msk->ack_seq;
- MPTCP_SKB_CB(skb)->end_seq += msk->ack_seq;
- }
-
- pr_debug("msk=%p ack_seq=%llx\n", msk, msk->ack_seq);
-}
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 6bd81904747066d8f2c1043dd81b372925f18cbb..55f9698f3c22f1dc423a7605c7b00bfda162b54c 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -135,7 +135,8 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
bool fragstolen;
int delta;
- if (MPTCP_SKB_CB(from)->offset ||
+ if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) ||
+ MPTCP_SKB_CB(from)->offset ||
((to->len + from->len) > (sk->sk_rcvbuf >> 3)) ||
!skb_try_coalesce(to, from, &fragstolen, &delta))
return false;
@@ -366,6 +367,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + copy_len;
MPTCP_SKB_CB(skb)->offset = offset;
MPTCP_SKB_CB(skb)->has_rxtstamp = has_rxtstamp;
+ MPTCP_SKB_CB(skb)->cant_coalesce = 0;
if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
/* in sequence */
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 37226cdd9e3717c4f8cf0d4c879a0feaaa91d459..3c3e9b185ae35d92b5a2daae994a4a9e76f9cc84 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -129,7 +129,8 @@ struct mptcp_skb_cb {
u64 map_seq;
u64 end_seq;
u32 offset;
- u8 has_rxtstamp:1;
+ u8 has_rxtstamp;
+ u8 cant_coalesce;
};
#define MPTCP_SKB_CB(__skb) ((struct mptcp_skb_cb *)&((__skb)->cb[0]))
@@ -1059,8 +1060,6 @@ void mptcp_event_pm_listener(const struct sock *ssk,
enum mptcp_event_type event);
bool mptcp_userspace_pm_active(const struct mptcp_sock *msk);
-void __mptcp_fastopen_gen_msk_ackseq(struct mptcp_sock *msk, struct mptcp_subflow_context *subflow,
- const struct mptcp_options_received *mp_opt);
void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subflow,
struct request_sock *req);
int mptcp_nl_fill_addr(struct sk_buff *skb,
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 2926bdf88e42c5f2db6875b00b4eca2dbf49dba2..d2caffa56bdd98f5fd9ef07fdcb3610ea186b848 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -802,9 +802,6 @@ void __mptcp_subflow_fully_established(struct mptcp_sock *msk,
subflow_set_remote_key(msk, subflow, mp_opt);
WRITE_ONCE(subflow->fully_established, true);
WRITE_ONCE(msk->fully_established, true);
-
- if (subflow->is_mptfo)
- __mptcp_fastopen_gen_msk_ackseq(msk, subflow, mp_opt);
}
static struct sock *subflow_syn_recv_sock(const struct sock *sk,
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH net-next 3/7] mptcp: move the whole rx path under msk socket lock protection
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 1/7] mptcp: consolidate subflow cleanup Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 2/7] mptcp: drop __mptcp_fastopen_gen_msk_ackseq() Matthieu Baerts (NGI0)
@ 2025-02-18 18:36 ` Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 4/7] mptcp: cleanup mem accounting Matthieu Baerts (NGI0)
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts (NGI0) @ 2025-02-18 18:36 UTC (permalink / raw)
To: mptcp, Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: Kuniyuki Iwashima, Willem de Bruijn, David Ahern,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, linux-kernel,
Matthieu Baerts (NGI0)
From: Paolo Abeni <pabeni@redhat.com>
After commit c2e6048fa1cf ("mptcp: fix race in release_cb") we can
move the whole MPTCP rx path under the socket lock leveraging the
release_cb.
We can drop a bunch of spin_lock pairs in the receive functions, use
a single receive queue and invoke __mptcp_move_skbs only when subflows
ask for it.
This will allow more cleanup in the next patch.
Some changes are worth specific mention:
The msk rcvbuf update now always happens under both the msk and the
subflow socket lock: we can drop a bunch of ONCE annotation and
consolidate the checks.
When the skbs move is delayed at msk release callback time, even the
msk rcvbuf update is delayed; additionally take care of such action in
__mptcp_move_skbs().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
net/mptcp/fastopen.c | 1 +
net/mptcp/protocol.c | 123 ++++++++++++++++++++++++---------------------------
net/mptcp/protocol.h | 2 +-
3 files changed, 60 insertions(+), 66 deletions(-)
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index 7777f5a2d14379853fcd13c4b57c5569be05a2e4..f85ad19f3dd6c4bcbf31228054ccfd30755db5bc 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -48,6 +48,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
MPTCP_SKB_CB(skb)->cant_coalesce = 1;
mptcp_data_lock(sk);
+ DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
mptcp_set_owner_r(skb, sk);
__skb_queue_tail(&sk->sk_receive_queue, skb);
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 55f9698f3c22f1dc423a7605c7b00bfda162b54c..8bdc7a7a58f31ac74d6a2156b2297af9cd90c635 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -645,18 +645,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
bool more_data_avail;
struct tcp_sock *tp;
bool done = false;
- int sk_rbuf;
-
- sk_rbuf = READ_ONCE(sk->sk_rcvbuf);
-
- if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
- int ssk_rbuf = READ_ONCE(ssk->sk_rcvbuf);
-
- if (unlikely(ssk_rbuf > sk_rbuf)) {
- WRITE_ONCE(sk->sk_rcvbuf, ssk_rbuf);
- sk_rbuf = ssk_rbuf;
- }
- }
pr_debug("msk=%p ssk=%p\n", msk, ssk);
tp = tcp_sk(ssk);
@@ -724,7 +712,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
WRITE_ONCE(tp->copied_seq, seq);
more_data_avail = mptcp_subflow_data_available(ssk);
- if (atomic_read(&sk->sk_rmem_alloc) > sk_rbuf) {
+ if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) {
done = true;
break;
}
@@ -848,11 +836,30 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
return moved > 0;
}
+static void __mptcp_rcvbuf_update(struct sock *sk, struct sock *ssk)
+{
+ if (unlikely(ssk->sk_rcvbuf > sk->sk_rcvbuf))
+ WRITE_ONCE(sk->sk_rcvbuf, ssk->sk_rcvbuf);
+}
+
+static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+
+ __mptcp_rcvbuf_update(sk, ssk);
+
+ /* over limit? can't append more skbs to msk, Also, no need to wake-up*/
+ if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
+ return;
+
+ /* Wake-up the reader only for in-sequence data */
+ if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
+ sk->sk_data_ready(sk);
+}
+
void mptcp_data_ready(struct sock *sk, struct sock *ssk)
{
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
- struct mptcp_sock *msk = mptcp_sk(sk);
- int sk_rbuf, ssk_rbuf;
/* The peer can send data while we are shutting down this
* subflow at msk destruction time, but we must avoid enqueuing
@@ -861,19 +868,11 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
if (unlikely(subflow->disposable))
return;
- ssk_rbuf = READ_ONCE(ssk->sk_rcvbuf);
- sk_rbuf = READ_ONCE(sk->sk_rcvbuf);
- if (unlikely(ssk_rbuf > sk_rbuf))
- sk_rbuf = ssk_rbuf;
-
- /* over limit? can't append more skbs to msk, Also, no need to wake-up*/
- if (__mptcp_rmem(sk) > sk_rbuf)
- return;
-
- /* Wake-up the reader only for in-sequence data */
mptcp_data_lock(sk);
- if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
- sk->sk_data_ready(sk);
+ if (!sock_owned_by_user(sk))
+ __mptcp_data_ready(sk, ssk);
+ else
+ __set_bit(MPTCP_DEQUEUE, &mptcp_sk(sk)->cb_flags);
mptcp_data_unlock(sk);
}
@@ -1946,16 +1945,17 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied);
-static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
+static int __mptcp_recvmsg_mskq(struct sock *sk,
struct msghdr *msg,
size_t len, int flags,
struct scm_timestamping_internal *tss,
int *cmsg_flags)
{
+ struct mptcp_sock *msk = mptcp_sk(sk);
struct sk_buff *skb, *tmp;
int copied = 0;
- skb_queue_walk_safe(&msk->receive_queue, skb, tmp) {
+ skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
u32 offset = MPTCP_SKB_CB(skb)->offset;
u32 data_len = skb->len - offset;
u32 count = min_t(size_t, len - copied, data_len);
@@ -1990,7 +1990,7 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
/* we will bulk release the skb memory later */
skb->destructor = NULL;
WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
- __skb_unlink(skb, &msk->receive_queue);
+ __skb_unlink(skb, &sk->sk_receive_queue);
__kfree_skb(skb);
msk->bytes_consumed += count;
}
@@ -2115,54 +2115,46 @@ static void __mptcp_update_rmem(struct sock *sk)
WRITE_ONCE(msk->rmem_released, 0);
}
-static void __mptcp_splice_receive_queue(struct sock *sk)
+static bool __mptcp_move_skbs(struct sock *sk)
{
+ struct mptcp_subflow_context *subflow;
struct mptcp_sock *msk = mptcp_sk(sk);
-
- skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue);
-}
-
-static bool __mptcp_move_skbs(struct mptcp_sock *msk)
-{
- struct sock *sk = (struct sock *)msk;
unsigned int moved = 0;
bool ret, done;
+ /* verify we can move any data from the subflow, eventually updating */
+ if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
+ mptcp_for_each_subflow(msk, subflow)
+ __mptcp_rcvbuf_update(sk, subflow->tcp_sock);
+
+ if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
+ return false;
+
do {
struct sock *ssk = mptcp_subflow_recv_lookup(msk);
bool slowpath;
- /* we can have data pending in the subflows only if the msk
- * receive buffer was full at subflow_data_ready() time,
- * that is an unlikely slow path.
- */
- if (likely(!ssk))
+ if (unlikely(!ssk))
break;
slowpath = lock_sock_fast(ssk);
- mptcp_data_lock(sk);
__mptcp_update_rmem(sk);
done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
- mptcp_data_unlock(sk);
if (unlikely(ssk->sk_err))
__mptcp_error_report(sk);
unlock_sock_fast(ssk, slowpath);
} while (!done);
- /* acquire the data lock only if some input data is pending */
ret = moved > 0;
if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
- !skb_queue_empty_lockless(&sk->sk_receive_queue)) {
- mptcp_data_lock(sk);
+ !skb_queue_empty(&sk->sk_receive_queue)) {
__mptcp_update_rmem(sk);
ret |= __mptcp_ofo_queue(msk);
- __mptcp_splice_receive_queue(sk);
- mptcp_data_unlock(sk);
}
if (ret)
mptcp_check_data_fin((struct sock *)msk);
- return !skb_queue_empty(&msk->receive_queue);
+ return ret;
}
static unsigned int mptcp_inq_hint(const struct sock *sk)
@@ -2170,7 +2162,7 @@ static unsigned int mptcp_inq_hint(const struct sock *sk)
const struct mptcp_sock *msk = mptcp_sk(sk);
const struct sk_buff *skb;
- skb = skb_peek(&msk->receive_queue);
+ skb = skb_peek(&sk->sk_receive_queue);
if (skb) {
u64 hint_val = READ_ONCE(msk->ack_seq) - MPTCP_SKB_CB(skb)->map_seq;
@@ -2216,7 +2208,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
while (copied < len) {
int err, bytes_read;
- bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied, flags, &tss, &cmsg_flags);
+ bytes_read = __mptcp_recvmsg_mskq(sk, msg, len - copied, flags, &tss, &cmsg_flags);
if (unlikely(bytes_read < 0)) {
if (!copied)
copied = bytes_read;
@@ -2225,7 +2217,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
copied += bytes_read;
- if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
+ if (skb_queue_empty(&sk->sk_receive_queue) && __mptcp_move_skbs(sk))
continue;
/* only the MPTCP socket status is relevant here. The exit
@@ -2251,7 +2243,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
/* race breaker: the shutdown could be after the
* previous receive queue check
*/
- if (__mptcp_move_skbs(msk))
+ if (__mptcp_move_skbs(sk))
continue;
break;
}
@@ -2295,9 +2287,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
}
}
- pr_debug("msk=%p rx queue empty=%d:%d copied=%d\n",
- msk, skb_queue_empty_lockless(&sk->sk_receive_queue),
- skb_queue_empty(&msk->receive_queue), copied);
+ pr_debug("msk=%p rx queue empty=%d copied=%d\n",
+ msk, skb_queue_empty(&sk->sk_receive_queue), copied);
release_sock(sk);
return copied;
@@ -2824,7 +2815,6 @@ static void __mptcp_init_sock(struct sock *sk)
INIT_LIST_HEAD(&msk->join_list);
INIT_LIST_HEAD(&msk->rtx_queue);
INIT_WORK(&msk->work, mptcp_worker);
- __skb_queue_head_init(&msk->receive_queue);
msk->out_of_order_queue = RB_ROOT;
msk->first_pending = NULL;
WRITE_ONCE(msk->rmem_fwd_alloc, 0);
@@ -3407,12 +3397,8 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
mptcp_for_each_subflow_safe(msk, subflow, tmp)
__mptcp_close_ssk(sk, mptcp_subflow_tcp_sock(subflow), subflow, flags);
- /* move to sk_receive_queue, sk_stream_kill_queues will purge it */
- mptcp_data_lock(sk);
- skb_queue_splice_tail_init(&msk->receive_queue, &sk->sk_receive_queue);
__skb_queue_purge(&sk->sk_receive_queue);
skb_rbtree_purge(&msk->out_of_order_queue);
- mptcp_data_unlock(sk);
/* move all the rx fwd alloc into the sk_mem_reclaim_final in
* inet_sock_destruct() will dispose it
@@ -3455,7 +3441,8 @@ void __mptcp_check_push(struct sock *sk, struct sock *ssk)
#define MPTCP_FLAGS_PROCESS_CTX_NEED (BIT(MPTCP_PUSH_PENDING) | \
BIT(MPTCP_RETRANSMIT) | \
- BIT(MPTCP_FLUSH_JOIN_LIST))
+ BIT(MPTCP_FLUSH_JOIN_LIST) | \
+ BIT(MPTCP_DEQUEUE))
/* processes deferred events and flush wmem */
static void mptcp_release_cb(struct sock *sk)
@@ -3489,6 +3476,11 @@ static void mptcp_release_cb(struct sock *sk)
__mptcp_push_pending(sk, 0);
if (flags & BIT(MPTCP_RETRANSMIT))
__mptcp_retrans(sk);
+ if ((flags & BIT(MPTCP_DEQUEUE)) && __mptcp_move_skbs(sk)) {
+ /* notify ack seq update */
+ mptcp_cleanup_rbuf(msk, 0);
+ sk->sk_data_ready(sk);
+ }
cond_resched();
spin_lock_bh(&sk->sk_lock.slock);
@@ -3726,7 +3718,8 @@ static int mptcp_ioctl(struct sock *sk, int cmd, int *karg)
return -EINVAL;
lock_sock(sk);
- __mptcp_move_skbs(msk);
+ if (__mptcp_move_skbs(sk))
+ mptcp_cleanup_rbuf(msk, 0);
*karg = mptcp_inq_hint(sk);
release_sock(sk);
break;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 3c3e9b185ae35d92b5a2daae994a4a9e76f9cc84..753456b73f90879126a36964924d2b6e08e2a1cc 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -124,6 +124,7 @@
#define MPTCP_FLUSH_JOIN_LIST 5
#define MPTCP_SYNC_STATE 6
#define MPTCP_SYNC_SNDBUF 7
+#define MPTCP_DEQUEUE 8
struct mptcp_skb_cb {
u64 map_seq;
@@ -325,7 +326,6 @@ struct mptcp_sock {
struct work_struct work;
struct sk_buff *ooo_last_skb;
struct rb_root out_of_order_queue;
- struct sk_buff_head receive_queue;
struct list_head conn_list;
struct list_head rtx_queue;
struct mptcp_data_frag *first_pending;
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH net-next 4/7] mptcp: cleanup mem accounting
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
` (2 preceding siblings ...)
2025-02-18 18:36 ` [PATCH net-next 3/7] mptcp: move the whole rx path under msk socket lock protection Matthieu Baerts (NGI0)
@ 2025-02-18 18:36 ` Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 5/7] net: dismiss sk_forward_alloc_get() Matthieu Baerts (NGI0)
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts (NGI0) @ 2025-02-18 18:36 UTC (permalink / raw)
To: mptcp, Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: Kuniyuki Iwashima, Willem de Bruijn, David Ahern,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, linux-kernel,
Matthieu Baerts (NGI0)
From: Paolo Abeni <pabeni@redhat.com>
After the previous patch, updating sk_forward_memory is cheap and
we can drop a lot of complexity from the MPTCP memory accounting,
removing the custom fwd mem allocations for rmem.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
net/mptcp/fastopen.c | 2 +-
net/mptcp/protocol.c | 115 ++++-----------------------------------------------
net/mptcp/protocol.h | 4 +-
3 files changed, 10 insertions(+), 111 deletions(-)
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
index f85ad19f3dd6c4bcbf31228054ccfd30755db5bc..b9e4511979028c10d232efbcaca68400fc4f2e7a 100644
--- a/net/mptcp/fastopen.c
+++ b/net/mptcp/fastopen.c
@@ -50,7 +50,7 @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
mptcp_data_lock(sk);
DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
- mptcp_set_owner_r(skb, sk);
+ skb_set_owner_r(skb, sk);
__skb_queue_tail(&sk->sk_receive_queue, skb);
mptcp_sk(sk)->bytes_received += skb->len;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 8bdc7a7a58f31ac74d6a2156b2297af9cd90c635..080877f8daf7e3ff36531f3e11079d2163676f2d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -118,17 +118,6 @@ static void mptcp_drop(struct sock *sk, struct sk_buff *skb)
__kfree_skb(skb);
}
-static void mptcp_rmem_fwd_alloc_add(struct sock *sk, int size)
-{
- WRITE_ONCE(mptcp_sk(sk)->rmem_fwd_alloc,
- mptcp_sk(sk)->rmem_fwd_alloc + size);
-}
-
-static void mptcp_rmem_charge(struct sock *sk, int size)
-{
- mptcp_rmem_fwd_alloc_add(sk, -size);
-}
-
static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
struct sk_buff *from)
{
@@ -151,7 +140,7 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
* negative one
*/
atomic_add(delta, &sk->sk_rmem_alloc);
- mptcp_rmem_charge(sk, delta);
+ sk_mem_charge(sk, delta);
kfree_skb_partial(from, fragstolen);
return true;
@@ -166,44 +155,6 @@ static bool mptcp_ooo_try_coalesce(struct mptcp_sock *msk, struct sk_buff *to,
return mptcp_try_coalesce((struct sock *)msk, to, from);
}
-static void __mptcp_rmem_reclaim(struct sock *sk, int amount)
-{
- amount >>= PAGE_SHIFT;
- mptcp_rmem_charge(sk, amount << PAGE_SHIFT);
- __sk_mem_reduce_allocated(sk, amount);
-}
-
-static void mptcp_rmem_uncharge(struct sock *sk, int size)
-{
- struct mptcp_sock *msk = mptcp_sk(sk);
- int reclaimable;
-
- mptcp_rmem_fwd_alloc_add(sk, size);
- reclaimable = msk->rmem_fwd_alloc - sk_unused_reserved_mem(sk);
-
- /* see sk_mem_uncharge() for the rationale behind the following schema */
- if (unlikely(reclaimable >= PAGE_SIZE))
- __mptcp_rmem_reclaim(sk, reclaimable);
-}
-
-static void mptcp_rfree(struct sk_buff *skb)
-{
- unsigned int len = skb->truesize;
- struct sock *sk = skb->sk;
-
- atomic_sub(len, &sk->sk_rmem_alloc);
- mptcp_rmem_uncharge(sk, len);
-}
-
-void mptcp_set_owner_r(struct sk_buff *skb, struct sock *sk)
-{
- skb_orphan(skb);
- skb->sk = sk;
- skb->destructor = mptcp_rfree;
- atomic_add(skb->truesize, &sk->sk_rmem_alloc);
- mptcp_rmem_charge(sk, skb->truesize);
-}
-
/* "inspired" by tcp_data_queue_ofo(), main differences:
* - use mptcp seqs
* - don't cope with sacks
@@ -316,25 +267,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb)
end:
skb_condense(skb);
- mptcp_set_owner_r(skb, sk);
-}
-
-static bool mptcp_rmem_schedule(struct sock *sk, struct sock *ssk, int size)
-{
- struct mptcp_sock *msk = mptcp_sk(sk);
- int amt, amount;
-
- if (size <= msk->rmem_fwd_alloc)
- return true;
-
- size -= msk->rmem_fwd_alloc;
- amt = sk_mem_pages(size);
- amount = amt << PAGE_SHIFT;
- if (!__sk_mem_raise_allocated(sk, size, amt, SK_MEM_RECV))
- return false;
-
- mptcp_rmem_fwd_alloc_add(sk, amount);
- return true;
+ skb_set_owner_r(skb, sk);
}
static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
@@ -352,7 +285,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
skb_orphan(skb);
/* try to fetch required memory from subflow */
- if (!mptcp_rmem_schedule(sk, ssk, skb->truesize)) {
+ if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RCVPRUNED);
goto drop;
}
@@ -377,7 +310,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
if (tail && mptcp_try_coalesce(sk, tail, skb))
return true;
- mptcp_set_owner_r(skb, sk);
+ skb_set_owner_r(skb, sk);
__skb_queue_tail(&sk->sk_receive_queue, skb);
return true;
} else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) {
@@ -1987,9 +1920,10 @@ static int __mptcp_recvmsg_mskq(struct sock *sk,
}
if (!(flags & MSG_PEEK)) {
- /* we will bulk release the skb memory later */
+ /* avoid the indirect call, we know the destructor is sock_wfree */
skb->destructor = NULL;
- WRITE_ONCE(msk->rmem_released, msk->rmem_released + skb->truesize);
+ atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+ sk_mem_uncharge(sk, skb->truesize);
__skb_unlink(skb, &sk->sk_receive_queue);
__kfree_skb(skb);
msk->bytes_consumed += count;
@@ -2103,18 +2037,6 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
msk->rcvq_space.time = mstamp;
}
-static void __mptcp_update_rmem(struct sock *sk)
-{
- struct mptcp_sock *msk = mptcp_sk(sk);
-
- if (!msk->rmem_released)
- return;
-
- atomic_sub(msk->rmem_released, &sk->sk_rmem_alloc);
- mptcp_rmem_uncharge(sk, msk->rmem_released);
- WRITE_ONCE(msk->rmem_released, 0);
-}
-
static bool __mptcp_move_skbs(struct sock *sk)
{
struct mptcp_subflow_context *subflow;
@@ -2138,7 +2060,6 @@ static bool __mptcp_move_skbs(struct sock *sk)
break;
slowpath = lock_sock_fast(ssk);
- __mptcp_update_rmem(sk);
done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
if (unlikely(ssk->sk_err))
@@ -2146,12 +2067,7 @@ static bool __mptcp_move_skbs(struct sock *sk)
unlock_sock_fast(ssk, slowpath);
} while (!done);
- ret = moved > 0;
- if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) ||
- !skb_queue_empty(&sk->sk_receive_queue)) {
- __mptcp_update_rmem(sk);
- ret |= __mptcp_ofo_queue(msk);
- }
+ ret = moved > 0 || __mptcp_ofo_queue(msk);
if (ret)
mptcp_check_data_fin((struct sock *)msk);
return ret;
@@ -2817,8 +2733,6 @@ static void __mptcp_init_sock(struct sock *sk)
INIT_WORK(&msk->work, mptcp_worker);
msk->out_of_order_queue = RB_ROOT;
msk->first_pending = NULL;
- WRITE_ONCE(msk->rmem_fwd_alloc, 0);
- WRITE_ONCE(msk->rmem_released, 0);
msk->timer_ival = TCP_RTO_MIN;
msk->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
@@ -3044,8 +2958,6 @@ static void __mptcp_destroy_sock(struct sock *sk)
sk->sk_prot->destroy(sk);
- WARN_ON_ONCE(READ_ONCE(msk->rmem_fwd_alloc));
- WARN_ON_ONCE(msk->rmem_released);
sk_stream_kill_queues(sk);
xfrm_sk_free_policy(sk);
@@ -3403,8 +3315,6 @@ void mptcp_destroy_common(struct mptcp_sock *msk, unsigned int flags)
/* move all the rx fwd alloc into the sk_mem_reclaim_final in
* inet_sock_destruct() will dispose it
*/
- sk_forward_alloc_add(sk, msk->rmem_fwd_alloc);
- WRITE_ONCE(msk->rmem_fwd_alloc, 0);
mptcp_token_destroy(msk);
mptcp_pm_free_anno_list(msk);
mptcp_free_local_addr_list(msk);
@@ -3500,8 +3410,6 @@ static void mptcp_release_cb(struct sock *sk)
if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags))
__mptcp_sync_sndbuf(sk);
}
-
- __mptcp_update_rmem(sk);
}
/* MP_JOIN client subflow must wait for 4th ack before sending any data:
@@ -3672,12 +3580,6 @@ static void mptcp_shutdown(struct sock *sk, int how)
__mptcp_wr_shutdown(sk);
}
-static int mptcp_forward_alloc_get(const struct sock *sk)
-{
- return READ_ONCE(sk->sk_forward_alloc) +
- READ_ONCE(mptcp_sk(sk)->rmem_fwd_alloc);
-}
-
static int mptcp_ioctl_outq(const struct mptcp_sock *msk, u64 v)
{
const struct sock *sk = (void *)msk;
@@ -3836,7 +3738,6 @@ static struct proto mptcp_prot = {
.hash = mptcp_hash,
.unhash = mptcp_unhash,
.get_port = mptcp_get_port,
- .forward_alloc_get = mptcp_forward_alloc_get,
.stream_memory_free = mptcp_stream_memory_free,
.sockets_allocated = &mptcp_sockets_allocated,
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 753456b73f90879126a36964924d2b6e08e2a1cc..613d556ed938a99a2800b4384ee4c6cda9483381 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -281,7 +281,6 @@ struct mptcp_sock {
u64 rcv_data_fin_seq;
u64 bytes_retrans;
u64 bytes_consumed;
- int rmem_fwd_alloc;
int snd_burst;
int old_wspace;
u64 recovery_snd_nxt; /* in recovery mode accept up to this seq;
@@ -296,7 +295,6 @@ struct mptcp_sock {
u32 last_ack_recv;
unsigned long timer_ival;
u32 token;
- int rmem_released;
unsigned long flags;
unsigned long cb_flags;
bool recovery; /* closing subflow write queue reinjected */
@@ -387,7 +385,7 @@ static inline void msk_owned_by_me(const struct mptcp_sock *msk)
*/
static inline int __mptcp_rmem(const struct sock *sk)
{
- return atomic_read(&sk->sk_rmem_alloc) - READ_ONCE(mptcp_sk(sk)->rmem_released);
+ return atomic_read(&sk->sk_rmem_alloc);
}
static inline int mptcp_win_from_space(const struct sock *sk, int space)
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH net-next 5/7] net: dismiss sk_forward_alloc_get()
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
` (3 preceding siblings ...)
2025-02-18 18:36 ` [PATCH net-next 4/7] mptcp: cleanup mem accounting Matthieu Baerts (NGI0)
@ 2025-02-18 18:36 ` Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 6/7] mptcp: dismiss __mptcp_rmem() Matthieu Baerts (NGI0)
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts (NGI0) @ 2025-02-18 18:36 UTC (permalink / raw)
To: mptcp, Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: Kuniyuki Iwashima, Willem de Bruijn, David Ahern,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, linux-kernel,
Matthieu Baerts (NGI0)
From: Paolo Abeni <pabeni@redhat.com>
After the previous patch we can remove the forward_alloc_get
proto callback, basically reverting commit 292e6077b040 ("net: introduce
sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use
sk_forward_alloc_get() in sk_get_meminfo()").
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
include/net/sock.h | 13 -------------
net/core/sock.c | 2 +-
net/ipv4/af_inet.c | 2 +-
net/ipv4/inet_diag.c | 2 +-
net/sched/em_meta.c | 2 +-
5 files changed, 4 insertions(+), 17 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 60ebf3c7b229e257b164e0de1f56543ea69f38f3..ac7fb5bd8ef9af10135a6e703408f2b24bd3d713 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1285,10 +1285,6 @@ struct proto {
unsigned int inuse_idx;
#endif
-#if IS_ENABLED(CONFIG_MPTCP)
- int (*forward_alloc_get)(const struct sock *sk);
-#endif
-
bool (*stream_memory_free)(const struct sock *sk, int wake);
bool (*sock_is_readable)(struct sock *sk);
/* Memory pressure */
@@ -1349,15 +1345,6 @@ int sock_load_diag_module(int family, int protocol);
INDIRECT_CALLABLE_DECLARE(bool tcp_stream_memory_free(const struct sock *sk, int wake));
-static inline int sk_forward_alloc_get(const struct sock *sk)
-{
-#if IS_ENABLED(CONFIG_MPTCP)
- if (sk->sk_prot->forward_alloc_get)
- return sk->sk_prot->forward_alloc_get(sk);
-#endif
- return READ_ONCE(sk->sk_forward_alloc);
-}
-
static inline bool __sk_stream_memory_free(const struct sock *sk, int wake)
{
if (READ_ONCE(sk->sk_wmem_queued) >= READ_ONCE(sk->sk_sndbuf))
diff --git a/net/core/sock.c b/net/core/sock.c
index 53c7af0038c4fca630e1ac2ebecf55558cb16eef..0d385bf27b38d97458e6a695a559f4f1600773c4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3882,7 +3882,7 @@ void sk_get_meminfo(const struct sock *sk, u32 *mem)
mem[SK_MEMINFO_RCVBUF] = READ_ONCE(sk->sk_rcvbuf);
mem[SK_MEMINFO_WMEM_ALLOC] = sk_wmem_alloc_get(sk);
mem[SK_MEMINFO_SNDBUF] = READ_ONCE(sk->sk_sndbuf);
- mem[SK_MEMINFO_FWD_ALLOC] = sk_forward_alloc_get(sk);
+ mem[SK_MEMINFO_FWD_ALLOC] = READ_ONCE(sk->sk_forward_alloc);
mem[SK_MEMINFO_WMEM_QUEUED] = READ_ONCE(sk->sk_wmem_queued);
mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
mem[SK_MEMINFO_BACKLOG] = READ_ONCE(sk->sk_backlog.len);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 21f46ee7b6e95329a2f7f0e0429eebf1648e7f9d..5df1f1325259d9b9dbe3be19a81066f85cf306e5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -153,7 +153,7 @@ void inet_sock_destruct(struct sock *sk)
WARN_ON_ONCE(atomic_read(&sk->sk_rmem_alloc));
WARN_ON_ONCE(refcount_read(&sk->sk_wmem_alloc));
WARN_ON_ONCE(sk->sk_wmem_queued);
- WARN_ON_ONCE(sk_forward_alloc_get(sk));
+ WARN_ON_ONCE(sk->sk_forward_alloc);
kfree(rcu_dereference_protected(inet->inet_opt, 1));
dst_release(rcu_dereference_protected(sk->sk_dst_cache, 1));
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 321acc8abf17e8c7d6a4e3326615123fff19deab..efe2a085cf68e90cd1e79b5556e667a0fd044bfd 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -282,7 +282,7 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
struct inet_diag_meminfo minfo = {
.idiag_rmem = sk_rmem_alloc_get(sk),
.idiag_wmem = READ_ONCE(sk->sk_wmem_queued),
- .idiag_fmem = sk_forward_alloc_get(sk),
+ .idiag_fmem = READ_ONCE(sk->sk_forward_alloc),
.idiag_tmem = sk_wmem_alloc_get(sk),
};
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 8996c73c9779b5fa804e6f913834cf1fe4d071e6..3f2e707a11d18922d7d9dd93e8315c1ab26eebc7 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -460,7 +460,7 @@ META_COLLECTOR(int_sk_fwd_alloc)
*err = -1;
return;
}
- dst->value = sk_forward_alloc_get(sk);
+ dst->value = READ_ONCE(sk->sk_forward_alloc);
}
META_COLLECTOR(int_sk_sndbuf)
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH net-next 6/7] mptcp: dismiss __mptcp_rmem()
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
` (4 preceding siblings ...)
2025-02-18 18:36 ` [PATCH net-next 5/7] net: dismiss sk_forward_alloc_get() Matthieu Baerts (NGI0)
@ 2025-02-18 18:36 ` Matthieu Baerts (NGI0)
2025-02-18 18:36 ` [PATCH net-next 7/7] mptcp: micro-optimize __mptcp_move_skb() Matthieu Baerts (NGI0)
2025-02-20 3:10 ` [PATCH net-next 0/7] mptcp: rx path refactor patchwork-bot+netdevbpf
7 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts (NGI0) @ 2025-02-18 18:36 UTC (permalink / raw)
To: mptcp, Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: Kuniyuki Iwashima, Willem de Bruijn, David Ahern,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, linux-kernel,
Matthieu Baerts (NGI0)
From: Paolo Abeni <pabeni@redhat.com>
After the RX path refactor, it become a wrapper for sk_rmem_alloc
access, with a slightly misleading name. Just drop it.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
net/mptcp/protocol.c | 8 ++++----
net/mptcp/protocol.h | 11 ++---------
2 files changed, 6 insertions(+), 13 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 080877f8daf7e3ff36531f3e11079d2163676f2d..c709f654cd5a4944390cf1e160f59cd3b509b66d 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -496,7 +496,7 @@ static void mptcp_cleanup_rbuf(struct mptcp_sock *msk, int copied)
bool cleanup, rx_empty;
cleanup = (space > 0) && (space >= (old_space << 1)) && copied;
- rx_empty = !__mptcp_rmem(sk) && copied;
+ rx_empty = !sk_rmem_alloc_get(sk) && copied;
mptcp_for_each_subflow(msk, subflow) {
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
@@ -645,7 +645,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
WRITE_ONCE(tp->copied_seq, seq);
more_data_avail = mptcp_subflow_data_available(ssk);
- if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) {
+ if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) {
done = true;
break;
}
@@ -782,7 +782,7 @@ static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
__mptcp_rcvbuf_update(sk, ssk);
/* over limit? can't append more skbs to msk, Also, no need to wake-up*/
- if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
+ if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
return;
/* Wake-up the reader only for in-sequence data */
@@ -2049,7 +2049,7 @@ static bool __mptcp_move_skbs(struct sock *sk)
mptcp_for_each_subflow(msk, subflow)
__mptcp_rcvbuf_update(sk, subflow->tcp_sock);
- if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
+ if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
return false;
do {
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 613d556ed938a99a2800b4384ee4c6cda9483381..a1a077bae7b6ec4fab5b266e2613acb145eb343f 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -380,14 +380,6 @@ static inline void msk_owned_by_me(const struct mptcp_sock *msk)
#define mptcp_sk(ptr) container_of_const(ptr, struct mptcp_sock, sk.icsk_inet.sk)
#endif
-/* the msk socket don't use the backlog, also account for the bulk
- * free memory
- */
-static inline int __mptcp_rmem(const struct sock *sk)
-{
- return atomic_read(&sk->sk_rmem_alloc);
-}
-
static inline int mptcp_win_from_space(const struct sock *sk, int space)
{
return __tcp_win_from_space(mptcp_sk(sk)->scaling_ratio, space);
@@ -400,7 +392,8 @@ static inline int mptcp_space_from_win(const struct sock *sk, int win)
static inline int __mptcp_space(const struct sock *sk)
{
- return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) - __mptcp_rmem(sk));
+ return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
+ sk_rmem_alloc_get(sk));
}
static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH net-next 7/7] mptcp: micro-optimize __mptcp_move_skb()
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
` (5 preceding siblings ...)
2025-02-18 18:36 ` [PATCH net-next 6/7] mptcp: dismiss __mptcp_rmem() Matthieu Baerts (NGI0)
@ 2025-02-18 18:36 ` Matthieu Baerts (NGI0)
2025-02-20 3:10 ` [PATCH net-next 0/7] mptcp: rx path refactor patchwork-bot+netdevbpf
7 siblings, 0 replies; 9+ messages in thread
From: Matthieu Baerts (NGI0) @ 2025-02-18 18:36 UTC (permalink / raw)
To: mptcp, Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: Kuniyuki Iwashima, Willem de Bruijn, David Ahern,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, linux-kernel,
Matthieu Baerts (NGI0)
From: Paolo Abeni <pabeni@redhat.com>
After the RX path refactor the mentioned function is expected to run
frequently, let's optimize it a bit.
Scan for ready subflow from the last processed one, and stop after
traversing the list once or reaching the msk memory limit - instead of
looking for dubious per-subflow conditions.
Also re-order the memory limit checks, to avoid duplicate tests.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
net/mptcp/protocol.c | 111 +++++++++++++++++++++++----------------------------
net/mptcp/protocol.h | 2 +
2 files changed, 52 insertions(+), 61 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index c709f654cd5a4944390cf1e160f59cd3b509b66d..6b61b7dee33be10294ae1101f9206144878a3192 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -569,15 +569,13 @@ static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk)
}
static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
- struct sock *ssk,
- unsigned int *bytes)
+ struct sock *ssk)
{
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
struct sock *sk = (struct sock *)msk;
- unsigned int moved = 0;
bool more_data_avail;
struct tcp_sock *tp;
- bool done = false;
+ bool ret = false;
pr_debug("msk=%p ssk=%p\n", msk, ssk);
tp = tcp_sk(ssk);
@@ -587,20 +585,16 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
struct sk_buff *skb;
bool fin;
+ if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+ break;
+
/* try to move as much data as available */
map_remaining = subflow->map_data_len -
mptcp_subflow_get_map_offset(subflow);
skb = skb_peek(&ssk->sk_receive_queue);
- if (!skb) {
- /* With racing move_skbs_to_msk() and __mptcp_move_skbs(),
- * a different CPU can have already processed the pending
- * data, stop here or we can enter an infinite loop
- */
- if (!moved)
- done = true;
+ if (unlikely(!skb))
break;
- }
if (__mptcp_check_fallback(msk)) {
/* Under fallback skbs have no MPTCP extension and TCP could
@@ -613,19 +607,13 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
offset = seq - TCP_SKB_CB(skb)->seq;
fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
- if (fin) {
- done = true;
+ if (fin)
seq++;
- }
if (offset < skb->len) {
size_t len = skb->len - offset;
- if (tp->urg_data)
- done = true;
-
- if (__mptcp_move_skb(msk, ssk, skb, offset, len))
- moved += len;
+ ret = __mptcp_move_skb(msk, ssk, skb, offset, len) || ret;
seq += len;
if (unlikely(map_remaining < len)) {
@@ -639,22 +627,16 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
}
sk_eat_skb(ssk, skb);
- done = true;
}
WRITE_ONCE(tp->copied_seq, seq);
more_data_avail = mptcp_subflow_data_available(ssk);
- if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) {
- done = true;
- break;
- }
} while (more_data_avail);
- if (moved > 0)
+ if (ret)
msk->last_data_recv = tcp_jiffies32;
- *bytes += moved;
- return done;
+ return ret;
}
static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
@@ -748,9 +730,9 @@ void __mptcp_error_report(struct sock *sk)
static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
{
struct sock *sk = (struct sock *)msk;
- unsigned int moved = 0;
+ bool moved;
- __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
+ moved = __mptcp_move_skbs_from_subflow(msk, ssk);
__mptcp_ofo_queue(msk);
if (unlikely(ssk->sk_err)) {
if (!sock_owned_by_user(sk))
@@ -766,7 +748,7 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
*/
if (mptcp_pending_data_fin(sk, NULL))
mptcp_schedule_work(sk);
- return moved > 0;
+ return moved;
}
static void __mptcp_rcvbuf_update(struct sock *sk, struct sock *ssk)
@@ -781,10 +763,6 @@ static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
__mptcp_rcvbuf_update(sk, ssk);
- /* over limit? can't append more skbs to msk, Also, no need to wake-up*/
- if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
- return;
-
/* Wake-up the reader only for in-sequence data */
if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
sk->sk_data_ready(sk);
@@ -884,20 +862,6 @@ bool mptcp_schedule_work(struct sock *sk)
return false;
}
-static struct sock *mptcp_subflow_recv_lookup(const struct mptcp_sock *msk)
-{
- struct mptcp_subflow_context *subflow;
-
- msk_owned_by_me(msk);
-
- mptcp_for_each_subflow(msk, subflow) {
- if (READ_ONCE(subflow->data_avail))
- return mptcp_subflow_tcp_sock(subflow);
- }
-
- return NULL;
-}
-
static bool mptcp_skb_can_collapse_to(u64 write_seq,
const struct sk_buff *skb,
const struct mptcp_ext *mpext)
@@ -2037,37 +2001,62 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
msk->rcvq_space.time = mstamp;
}
+static struct mptcp_subflow_context *
+__mptcp_first_ready_from(struct mptcp_sock *msk,
+ struct mptcp_subflow_context *subflow)
+{
+ struct mptcp_subflow_context *start_subflow = subflow;
+
+ while (!READ_ONCE(subflow->data_avail)) {
+ subflow = mptcp_next_subflow(msk, subflow);
+ if (subflow == start_subflow)
+ return NULL;
+ }
+ return subflow;
+}
+
static bool __mptcp_move_skbs(struct sock *sk)
{
struct mptcp_subflow_context *subflow;
struct mptcp_sock *msk = mptcp_sk(sk);
- unsigned int moved = 0;
- bool ret, done;
+ bool ret = false;
+
+ if (list_empty(&msk->conn_list))
+ return false;
/* verify we can move any data from the subflow, eventually updating */
if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
mptcp_for_each_subflow(msk, subflow)
__mptcp_rcvbuf_update(sk, subflow->tcp_sock);
- if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
- return false;
-
- do {
- struct sock *ssk = mptcp_subflow_recv_lookup(msk);
+ subflow = list_first_entry(&msk->conn_list,
+ struct mptcp_subflow_context, node);
+ for (;;) {
+ struct sock *ssk;
bool slowpath;
- if (unlikely(!ssk))
+ /*
+ * As an optimization avoid traversing the subflows list
+ * and ev. acquiring the subflow socket lock before baling out
+ */
+ if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
break;
- slowpath = lock_sock_fast(ssk);
- done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
+ subflow = __mptcp_first_ready_from(msk, subflow);
+ if (!subflow)
+ break;
+ ssk = mptcp_subflow_tcp_sock(subflow);
+ slowpath = lock_sock_fast(ssk);
+ ret = __mptcp_move_skbs_from_subflow(msk, ssk) || ret;
if (unlikely(ssk->sk_err))
__mptcp_error_report(sk);
unlock_sock_fast(ssk, slowpath);
- } while (!done);
- ret = moved > 0 || __mptcp_ofo_queue(msk);
+ subflow = mptcp_next_subflow(msk, subflow);
+ }
+
+ __mptcp_ofo_queue(msk);
if (ret)
mptcp_check_data_fin((struct sock *)msk);
return ret;
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index a1a077bae7b6ec4fab5b266e2613acb145eb343f..ca65f8bff632ff806fe761f86e9aa065b0657d1e 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -354,6 +354,8 @@ struct mptcp_sock {
list_for_each_entry(__subflow, &((__msk)->conn_list), node)
#define mptcp_for_each_subflow_safe(__msk, __subflow, __tmp) \
list_for_each_entry_safe(__subflow, __tmp, &((__msk)->conn_list), node)
+#define mptcp_next_subflow(__msk, __subflow) \
+ list_next_entry_circular(__subflow, &((__msk)->conn_list), node)
extern struct genl_family mptcp_genl_family;
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH net-next 0/7] mptcp: rx path refactor
2025-02-18 18:36 [PATCH net-next 0/7] mptcp: rx path refactor Matthieu Baerts (NGI0)
` (6 preceding siblings ...)
2025-02-18 18:36 ` [PATCH net-next 7/7] mptcp: micro-optimize __mptcp_move_skb() Matthieu Baerts (NGI0)
@ 2025-02-20 3:10 ` patchwork-bot+netdevbpf
7 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-02-20 3:10 UTC (permalink / raw)
To: Matthieu Baerts
Cc: mptcp, martineau, geliang, davem, edumazet, kuba, pabeni, horms,
kuniyu, willemb, dsahern, jhs, xiyou.wangcong, jiri, netdev,
linux-kernel
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 18 Feb 2025 19:36:11 +0100 you wrote:
> Paolo worked on this RX path refactor for these two main reasons:
>
> - Currently, the MPTCP RX path introduces quite a bit of 'exceptional'
> accounting/locking processing WRT to plain TCP, adding up to the
> implementation complexity in a miserable way.
>
> - The performance gap WRT plain TCP for single subflow connections is
> quite measurable.
>
> [...]
Here is the summary with links:
- [net-next,1/7] mptcp: consolidate subflow cleanup
https://git.kernel.org/netdev/net-next/c/c3349a22c200
- [net-next,2/7] mptcp: drop __mptcp_fastopen_gen_msk_ackseq()
https://git.kernel.org/netdev/net-next/c/f03afb3aeb9d
- [net-next,3/7] mptcp: move the whole rx path under msk socket lock protection
https://git.kernel.org/netdev/net-next/c/bc68b0efa1bf
- [net-next,4/7] mptcp: cleanup mem accounting
https://git.kernel.org/netdev/net-next/c/6639498ed85f
- [net-next,5/7] net: dismiss sk_forward_alloc_get()
https://git.kernel.org/netdev/net-next/c/c8802ded4658
- [net-next,6/7] mptcp: dismiss __mptcp_rmem()
https://git.kernel.org/netdev/net-next/c/51fe9cb9213e
- [net-next,7/7] mptcp: micro-optimize __mptcp_move_skb()
https://git.kernel.org/netdev/net-next/c/e0ca4057e0ec
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 9+ messages in thread