public inbox for mptcp@lists.linux.dev
 help / color / mirror / Atom feed
From: gang.yan@linux.dev
To: "Paolo Abeni" <pabeni@redhat.com>, mptcp@lists.linux.dev
Subject: Re: [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue
Date: Wed, 15 Apr 2026 08:21:04 +0000	[thread overview]
Message-ID: <ef0ce8465f0b852d5e4dc5e54031a156f606b636@linux.dev> (raw)
In-Reply-To: <d34641c0-01af-4273-83d7-5e7ed92e8da7@redhat.com>

April 15, 2026 at 3:17 PM, "Paolo Abeni" <pabeni@redhat.com mailto:pabeni@redhat.com?to=%22Paolo%20Abeni%22%20%3Cpabeni%40redhat.com%3E > wrote:


> 
> On 4/1/26 10:12 PM, Paolo Abeni wrote:
> 
> > 
> > On 4/1/26 10:54 AM, Gang Yan wrote:
> > 
> > > 
> > > @@ -2210,12 +2260,12 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
> > >  mem_cgroup_from_sk(sk));
> > >  
> > >  /* Don't spool the backlog if the rcvbuf is full. */
> > >  - if (list_empty(&msk->backlog_list) ||
> > >  + if (RB_EMPTY_ROOT(&msk->backlog_queue) ||
> > >  sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
> > >  return false;
> > >  
> > >  INIT_LIST_HEAD(skbs);
> > >  - list_splice_init(&msk->backlog_list, skbs);
> > >  + mptcp_backlog_queue_to_list(msk, skbs);
> > > 
> >  
> >  This is under a spinlock with BH disabled, and N is potentially quite
> >  high. This code is problematic as it could cause stall and delay BH for
> >  a significant amount of time.
> >  
> >  Note that there are more chunks later and in later patches with similar
> >  problem.
> >  
> >  I don't have a good solution on top of my head, I'll try to go back to
> >  the design table.
> > 
> I had some time to have a better look at the whole problem and how it
> manifest itself in the included self-tests - comprising the initial one
> using sendfile().
> 
> AFAICS the stall in the self-tests in patch 5/5 is caused by the sysctl
> setting taking effect on the server side _after_ that the 3whs
> negotiated the initial window; the rcvbuf suddenly shrinks from ~128K to
> 4K and almost every incoming packet is dropped.
> 
> The test itself is really an extreme condition; we should accept any
> implementation able to complete the transfer - even at very low speed.
> 
> The initial test-case, the one using sendfile(), operates in a
> significantly different way: it generates 1-bytes len DSS preventing
> coalescing (I've not understood yet why coalescing does not happen),
> which cause an extremely bad skb->truesize/skb->len ratio, which in turn
> causes the initial window being way too "optimistic", extreme rcvbuf
> squeeze at runtime and a behavior similar to the previous one.
> 
> In both cases simply dropping incoming packets early/in
> mptcp_incoming_options() when the rcvbuf is full does not solve the
> issue: if the rcvbuf is used (mostly) by the OoO queue, retransmissions
> always hit the same rcvbuf condition and are also dropped.
> 
> The root cause of both scenario is that some very unlikely condition
> calls to retract the announced rcv wnd, but mptcp can't do that.
> 
> Currently I start to think that we need a strategy similar to plain TCP
> to deal with such scenario: when rcvbuf is full we need to condense and
> eventually prune the OoO queue (see tcp_prune_queue(),
> tcp_collapse_ofo_queue(), tcp_collapse()).
> 
> The above has some serious downsides, i.e. it could lead to large slice
> of almost duplicate complex code, as is diff to abstract the MPTCP vs
> TCP differences (CB, seq numbers, drop reasons). Still under investigation.
> 
> /P
>
Hi, Paolo
Thanks a lot for your detailed and insightful analysis of this problem!

I fully agree with your points: MPTCP should allow the transfer to complete
even under extremely slow or harsh conditions, just as you mentioned.

Regarding the TCP-style mechanisms like 'tcp_prune_queue' for handling full
rcvbuf conditions — I have actually attempted similar implementations before.
As you pointed out, this approach is indeed highly complex for MPTCP. There
are far too many aspects that require careful modification and consideration,
making it extremely challenging to implement correctly.

I’ve also made some progress on this issue recently. I implemented a solution
based on the backlog list structure: when a transfer stall is detected, I
traverse the backlog_list, and only move the skbs where 'map_seq == ack_seq' into
the rcv_queue to unblock the transfer.

This approach effectively prevents unbounded growth of rmem_alloc, and has nearly
no impact on performance. The core logic is roughly as follows:

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 17b9a8c13ebf..c733dd2aa85f 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2189,6 +2189,18 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
        msk->rcvq_space.time = mstamp;
 }

+static bool mptcp_stop_spool_backlog(struct sock *sk)
+{
+       return sk_rmem_alloc_get(sk) > sk->sk_rcvbuf &&
+              !skb_queue_empty(&sk->sk_receive_queue);
+}
+
+static bool mptcp_recv_stall(struct sock *sk)
+{
+       return sk_rmem_alloc_get(sk) > sk->sk_rcvbuf &&
+              skb_queue_empty(&sk->sk_receive_queue);
+}
+
 static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delta)
 {
        struct sk_buff *skb = list_first_entry(skbs, struct sk_buff, list);
@@ -2198,7 +2210,7 @@ static bool __mptcp_move_skbs(struct sock *sk, struct list_head *skbs, u32 *delt
        *delta = 0;
        while (1) {
                /* If the msk recvbuf is full stop, don't drop */
-               if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+               if (mptcp_stop_spool_backlog(sk))
                        break;

                prefetch(skb->next);
@@ -2228,12 +2240,32 @@ static bool mptcp_can_spool_backlog(struct sock *sk, struct list_head *skbs)
        DEBUG_NET_WARN_ON_ONCE(msk->backlog_unaccounted && sk->sk_socket &&
                               mem_cgroup_from_sk(sk));

-       /* Don't spool the backlog if the rcvbuf is full. */
+       /* Don't spool the backlog if the rcvbuf is full when rcv_queue is
+        * not empty. 
+        */
        if (list_empty(&msk->backlog_list) ||
-           sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
+           mptcp_stop_spool_backlog(sk))
                return false;

        INIT_LIST_HEAD(skbs);
+
+       /* If the rcvbuf is full and the rcv_queue is empty, the transmission
+        * will stall. To prevent this, move the ack_seq skb from the
+        * backlog_list to the receive queue.
+        */
+       if (mptcp_recv_stall(sk)) {
+               struct sk_buff *iter, *tmp;
+
+               list_for_each_entry_safe(iter, tmp, &msk->backlog_list, list) {
+                       if (MPTCP_SKB_CB(iter)->map_seq == msk->ack_seq) {
+                               list_del(&iter->list);
+                               list_add(&iter->list, skbs);
+                               break;
+                       }
+               }
+               return !list_empty(skbs);
+       }
+
        list_splice_init(&msk->backlog_list, skbs);
        return true;
 }

I would really appreciate it if you could share your thoughts on this approach when
you have time.

Cheers, 
Gang

  reply	other threads:[~2026-04-15  8:21 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01  8:54 [PATCH mptcp-net v5 0/5] mptcp: fix stall because of data_ready Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 1/5] mptcp: replace backlog_list with backlog_queue Gang Yan
2026-04-01 20:12   ` Paolo Abeni
2026-04-15  7:17     ` Paolo Abeni
2026-04-15  8:21       ` gang.yan [this message]
2026-04-20  9:34         ` Paolo Abeni
2026-04-20  9:41           ` gang.yan
2026-04-20 10:33             ` Paolo Abeni
2026-04-21  1:26               ` gang.yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 2/5] mptcp: fix the stall problems using backlog_queue Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 3/5] mptcp: fix the stall problems with data_ready Gang Yan
2026-04-01  8:54 ` [PATCH mptcp-net v5 4/5] mptcp: fix the dead_lock in mptcp_data_ready Gang Yan
2026-04-01 20:02   ` Paolo Abeni
     [not found]     ` <e1f58d941ad141f9f96a24d1a1d9d6ca74cc2f5e@linux.dev>
2026-04-03  6:45       ` Fwd: " gang.yan
2026-04-13 15:49         ` Paolo Abeni
2026-04-01  8:54 ` [PATCH mptcp-net v5 5/5] selftests: mptcp: test transmission with small rcvbuf Gang Yan
2026-04-03  1:17   ` Geliang Tang
2026-04-03  8:52     ` Matthieu Baerts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ef0ce8465f0b852d5e4dc5e54031a156f606b636@linux.dev \
    --to=gang.yan@linux.dev \
    --cc=mptcp@lists.linux.dev \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox