Netdev List
 help / color / mirror / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Andrei Vagin <avagin@google.com>
Cc: Eric Dumazet <edumazet@google.com>,
	Kuniyuki Iwashima <kuniyu@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Pavel Emelyanov <xemul@scylladb.com>,
	Laurent Vivier <lvivier@redhat.com>,
	Jon Maloy <jmaloy@redhat.com>, Dmitry Safonov <dima@arista.com>,
	netdev@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Neal Cardwell <ncardwell@google.com>,
	Simon Horman <horms@kernel.org>, Shuah Khan <shuah@kernel.org>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
Date: Tue, 19 May 2026 18:11:58 +0200 (CEST)	[thread overview]
Message-ID: <20260519181157.1b0605f8@elisabeth> (raw)
In-Reply-To: <CAEWA0a5AG6JmeybHcxU+EL_Ohi0uVnSaVCiz7Bp_oDVuEdtO=g@mail.gmail.com>

On Tue, 19 May 2026 08:36:27 -0700
Andrei Vagin <avagin@google.com> wrote:

> On Mon, May 18, 2026 at 4:34 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Mon, 18 May 2026 00:57:16 -0700
> > Eric Dumazet <edumazet@google.com> wrote:
> >  
> > > On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:  
> > > >
> > > > On Sun, 17 May 2026 12:05:45 -0700
> > > > Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > > >  
> > > > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:  
> > > > > >
> > > > > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > > > > (using TCP_REPAIR_QUEUE to select it).
> > > > > >
> > > > > > If we receive data after the application fetched the sequence number
> > > > > > or saved the contents of the queue, though, the application will now
> > > > > > have outdated information, which defeats the whole functionality,
> > > > > > because this leads to gaps in sequence and data once they're restored
> > > > > > by the target instance of the application, resulting in a hanging or
> > > > > > otherwise non-functional TCP connection.
> > > > > >
> > > > > > This type of race condition was discovered in the KubeVirt integration
> > > > > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > > > > server running in the guest which is being migrated. The setup allows
> > > > > > traffic to reach the origin node hosting the guest during the
> > > > > > migration.
> > > > > >
> > > > > > If passt dumps sequence number and contents of the queue *before*
> > > > > > further data is received and acknowledged to the peer by the kernel,
> > > > > > once the TCP data connection is migrated to the target node, the
> > > > > > remote client becomes unable to continue sending, because a portion
> > > > > > of the data it sent *and received an acknowledgement for* is now lost.
> > > > > >
> > > > > > Schematically:
> > > > > >
> > > > > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > > > > >
> > > > > > 2. client <--ACK: 100-- origin host
> > > > > >
> > > > > > 3. migration starts,  
> > > > >
> > > > > Here, a netfilter rule or bpf prog must be installed to
> > > > > drop packets temporarily until migration completes.  
> > > >
> > > > Thanks for the review.
> > > >
> > > > I have to say it's rather unexpected to have to work around obvious
> > > > kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> > > > and this is handled correctly on the sending side (see
> > > > tcp_write_xmit()), but was clearly forgotten on the receiving side.  
> > >
> > > Have you contacted TCP repair authors for best practices?  
> >
> > I Cc'ed them here, Pavel is the author (but as far as I understand not
> > active in kernel development anymore), and I know that Andrei did some
> > substantial work on it in the past, so he's Cc'ed here as well.
> >
> > But we are following what CRIU (the userspace reference implementation)
> > does, and CRIU would be affected by this issue as well (depending on
> > usage).  
> 
> Before extracting the socket state, CRIU uses netfilter (iptables or
> nftables) to block all traffic associated with the specific TCP
> connection or, in the case of a container, the entire network namespace.

...by default, yes, by "depending on usage" I actually meant
'--network-lock skip', but I have to admit I'm not sure how commonly
that is used.

> This approach provides two main benefits:
> 
> During the dump, we don't need to leave all sockets in repair mode for
> the entire duration of the dump.  We enable and disable repair mode just
> to grab the state. It's simplify a roll back if something goes wrong.

Thanks for clarifying this aspect, I wasn't sure whether it was done to
make rollback easier.

That wouldn't really simplify things with passt in case of rollback
because, as we anyway keep track of flows internally (we implement
them, after all), we want to keep those sockets (frozen) in repair mode
until we're able to process traffic again (something CRIU doesn't need
to do), and explicitly take them out of repair mode at a rather precise
point in time.

> During restoration, it ensures that the destination kernel will not
> process any packets until the socket is fully reconstructed. This
> prevents the kernel from sending a Reset (RST) or getting out of sync
> before the connection is ready to be resumed.

Right, we don't have that issue in passt (anymore) because we keep
sockets open on the origin node anyway, and we block migration until
we're done restoring connection sockets and listening sockets on the
target.

> I haven't looked at the patch yet, but I have no objections to the idea
> itself.

Thanks for the explanation and for having a look! For passt (and
KubeVirt) that would save a lot of complexity, plus the security
aspects myself and David raised.

And, besides that and the usage CRIU and passt make of it, I still think
that allowing data to be queued while somebody can dump the queue is a
fundamental race condition.

-- 
Stefano


  reply	other threads:[~2026-05-19 16:12 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-17 18:41 [PATCH net 0/2] Fix race condition between TCP_REPAIR dump and data receive Stefano Brivio
2026-05-17 18:41 ` [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode Stefano Brivio
2026-05-17 19:05   ` Kuniyuki Iwashima
2026-05-17 19:53     ` Stefano Brivio
2026-05-17 21:01       ` Kuniyuki Iwashima
2026-05-18  5:32         ` David Gibson
2026-05-18  7:57       ` Eric Dumazet
2026-05-18 11:28         ` Stefano Brivio
2026-05-19 15:36           ` Andrei Vagin
2026-05-19 16:11             ` Stefano Brivio [this message]
2026-05-19 18:42               ` Andrei Vagin
2026-05-17 18:41 ` [PATCH net 2/2] selftests: Add data path tests for TCP_REPAIR mode Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260519181157.1b0605f8@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=avagin@google.com \
    --cc=davem@davemloft.net \
    --cc=david@gibson.dropbear.id.au \
    --cc=dima@arista.com \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=jmaloy@redhat.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=lvivier@redhat.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=shuah@kernel.org \
    --cc=xemul@scylladb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox