From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AA6453FDF for ; Tue, 6 Jun 2023 07:07:47 +0000 (UTC) Received: from mail-wm1-x334.google.com (mail-wm1-x334.google.com [IPv6:2a00:1450:4864:20::334]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D6C3DE47 for ; Tue, 6 Jun 2023 00:07:44 -0700 (PDT) Received: by mail-wm1-x334.google.com with SMTP id 5b1f17b1804b1-3f7e4953107so29615e9.1 for ; Tue, 06 Jun 2023 00:07:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686035263; x=1688627263; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Mh/Q9ExSH8PH6rQafVFdrgOp1kcbGrC6Sma+EsrSOgA=; b=DYJEmx4a/tObJALEFG7EuEAwywJTGI6PVfoAUiPhzhBjyCopsRlgJAMUIrOBNrWuv2 A+XlsU1IvBStPTd/he2Qxd7KBqNwmvt+VVoa81GzKgtWg7KPrHh0q2EkBy7F1ABhUADo /k9jv/syrralrl0dOTe/zjzrHSGeFQpd99tRfnEw8ADGVuq9MbAde6IF1EXtR/8QRcn9 7N663ST82p7mgCLA4Bmd96NTJRiOVb4jHZ5pS5d4io1AeuYsL29riVxcnLI8OfGhRtJB WjAaWDFzIwQx69+0lROTDV0NanoWI3Coc3/6CkYHvir//FDx6UaeAKoIzxHZWQSDvX/9 H7SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686035263; x=1688627263; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Mh/Q9ExSH8PH6rQafVFdrgOp1kcbGrC6Sma+EsrSOgA=; b=G8wk5qQXtcR0cx7Bgt0cbWTbkep4w8cjmGa3MhannLBH3vZyNiFI3gHY+kg238KYbj I487HX67wnHu7AORDJq7288AM0hMS7mWZLqJuKjjDALjqYkqi/cdVd2ZPf2s2rWzxxFv r8dA3E2tVC7Uuy8kHYtwjL/3eN/wG9cg4Ff2JLqA1uOFrdAc68HSWF2RZ/H4t5ZlyT1L Emw9GV3P1dGnilPrYjf+LcXvICTgGKlznbpcdoe4+AVq0GUe+Fqqqzl7TC6202VBbn65 8qdQDnnhVSW3AH4kj7RSykav600TmPadUMnSqs9cPbk9wD+7N7wFXq55eWVUjFVJeww9 CULA== X-Gm-Message-State: AC+VfDxV2CAUL6XKiBNGO9f6UtDzvtuuwUZEjqY6ThHopAlk+4dspg80 HgX4gXDU1Gips9/Gc0Gy9wRmLsMYDyiDtinYRnq87A== X-Google-Smtp-Source: ACHHUZ4C3vax/6KRqduuNBUvE/SN1iYEdb5YCsbljmN9bLu8N0lPogx1+KHVd03fzB07mISQjkS9fHSsxlxTPHAOeeg= X-Received: by 2002:a05:600c:1c16:b0:3f1:6fe9:4a95 with SMTP id j22-20020a05600c1c1600b003f16fe94a95mr137949wms.4.1686035263064; Tue, 06 Jun 2023 00:07:43 -0700 (PDT) Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20230606064306.9192-1-duanmuquan@baidu.com> In-Reply-To: <20230606064306.9192-1-duanmuquan@baidu.com> From: Eric Dumazet Date: Tue, 6 Jun 2023 09:07:30 +0200 Message-ID: Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race. To: Duan Muquan Cc: davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net On Tue, Jun 6, 2023 at 8:43=E2=80=AFAM Duan Muquan w= rote: > > If the FIN from passive closer and the ACK for active closer's FIN are > processed on different CPUs concurrently, tw hashdance race may occur. > On loopback interface, transmit function queues a skb to current CPU's > softnet's input queue by default. Suppose active closer runs on CPU 0, > and passive closer runs on CPU 1. If the ACK for the active closer's > FIN is sent with no delay, it will be processed and tw hashdance will > be done on CPU 0; The passive closer's FIN will be sent in another > segment and processed on CPU 1, it may fail to find tw sock in the > ehash table due to tw hashdance on CPU 0, then get a RESET. > If application reconnects immediately with the same source port, it > will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2. > > The dmesg to trace down this issue: > > .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3 > .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3 > .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3 > .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3 > .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3 > .333544] __inet_lookup_established: Failed the refcount check: > !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 c= pu 0 > .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3 > .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0 > .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0 > .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0 > .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0 > > Here is the call trace map: > > CPU 0 CPU 1 > > -------- -------- > tcp_close() > tcp_send_fin() > loopback_xmit() > netif_rx() > tcp_v4_rcv() > tcp_ack_snd_check() > loopback_xmit > netif_rx() tcp_close() > ... tcp_send_fin() > = loopback_xmit() > = netif_rx() > = tcp_v4_rcv() > = ... > tcp_time_wait() > inet_twsk_hashdance() { > ... > <-__inet_lookup_established() > (bad case= (1), find sk, may fail tw_refcnt check) > inet_twsk_add_node_tail_rcu(tw, ...) > <-__inet_lookup_established() > (bad case= (1), find sk, may fail tw_refcnt check) > > __sk_nulls_del_node_init_rcu(sk) > <-__inet_lookup_established() > (bad case= (2), find tw, may fail tw_refcnt check) > refcount_set(&tw->tw_refcnt, 3) > <-__inet_lookup_established() > (good cas= e, find tw, tw_refcnt is not 0) > ... > } > > This issue occurs with a small probability on our application working > on loopback interface, client gets a connection refused error when it > reconnects. In reproducing environments on kernel 4.14,5.10 and > 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the > time; Let client connect server and server sends a message to client > then close the connection; Repeat this process forever; Let the client > bind the same source port every time, it can be reproduced in about 20 > minutes. > > Brief of the scenario: > > 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes > connection actively and sends a FIN to client. The lookback's driver > enqueues the FIN segment to backlog queue of CPU 0 via > loopback_xmit()->netif_rx(), one of the conditions for non-delay ack > meets in __tcp_ack_snd_check(), and the ACK is sent immediately. > > 2. On loopback interface, the ACK is received and processed on CPU 0, > the 'dance' from original sock to tw sock will perfrom, tw sock will > be inserted to ehash table, then the original sock will be removed. > > 3. On CPU 1, client closes the connection, a FIN segment is sent and > processed on CPU 1. When it is looking up sock in ehash table (with no > lock), tw hashdance race may occur, it fails to find the tw sock and > get a listener sock in the flowing 3 cases: > > (1) Original sock is found, but it has been destroyed and sk_refcnt > has become 0 when validating it. > (2) tw sock is found, but its tw_refcnt has not been set to 3, it is > still 0, validating for sk_refcnt will fail. > (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5 > 7daa20449d171f2887043), tw sock is added to the head of the lis= t. > It will be missed if the list is traversed before tw sock is > added. And if the original sock is removed before it is found, = no > established sock will be found. > > The listener sock will reset the FIN segment which has ack bit set. > > 4. If client reconnects immediately and is assigned with the same > source port as previous connection, the tw sock with tw_substate > TCP_FIN_WAIT2 will reset client's SYN and destroy itself in > inet_twsk_deschedule_put(). Application gets a connection refused > error. > > 5. If client reconnects again, it will succeed. > > Introduce the flowing 2 modifications to solve the above 3 bad cases: > > For bad case (2): > Set tw_refcnt to 3 before adding it into list. > > For bad case (1): > In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener > sock and this segment has FIN bit set, then retry the lookup process > one time. This fix can cover bad case (3) for the versions without > Eric and Jason's fix. > > There may be another bad case, if the original sock is found and passes > validation, but during further process for the passive closer's FIN on > CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will > be dropped and retransmitted. This case does not hurt application as > much as resetting reconnection, and this case has less possibility than > the other bad cases, it does not occur on our product and in > experimental environment, so it is not considered in this patch. > > Could you please check whether this fix is OK, or any suggestions? > Looking forward for your precious comments! > > Signed-off-by: Duan Muquan > --- > net/ipv4/inet_timewait_sock.c | 15 +++++++-------- > net/ipv4/tcp_ipv4.c | 13 +++++++++++++ > 2 files changed, 20 insertions(+), 8 deletions(-) > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.= c > index 40052414c7c7..ed1f255c9aa8 100644 > --- a/net/ipv4/inet_timewait_sock.c > +++ b/net/ipv4/inet_timewait_sock.c > @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *= tw, struct sock *sk, > > spin_lock(lock); > > - inet_twsk_add_node_tail_rcu(tw, &ehead->chain); > - > - /* Step 3: Remove SK from hash chain */ > - if (__sk_nulls_del_node_init_rcu(sk)) > - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > - > - spin_unlock(lock); > - > /* tw_refcnt is set to 3 because we have : > * - one reference for bhash chain. > * - one reference for ehash chain. > @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *= tw, struct sock *sk, > * so we are not allowed to use tw anymore. > */ > refcount_set(&tw->tw_refcnt, 3); > + inet_twsk_add_node_tail_rcu(tw, &ehead->chain); > + > + /* Step 3: Remove SK from hash chain */ > + if (__sk_nulls_del_node_init_rcu(sk)) > + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > + > + spin_unlock(lock); > } > EXPORT_SYMBOL_GPL(inet_twsk_hashdance); > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 06d2573685ca..3e3cef202f76 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb) > sk =3D __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo, > skb, __tcp_hdrlen(th), th->source, > th->dest, sdif, &refcounted); > + > + /* If tw "dance" is performed on another CPU, the lookup process = may find > + * no tw sock for the passive closer's FIN segment, but a listene= r sock, > + * which will reset the FIN segment. If application reconnects im= mediately > + * with the same source port, it will get reset because the tw so= ck's > + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in = another try. > + */ > + if (unlikely(th->fin && sk && sk->sk_state =3D=3D TCP_LISTEN)) { > + sk =3D __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo= , > + skb, __tcp_hdrlen(th), th->source, > + th->dest, sdif, &refcounted); > + } > + I do not think this fixes anything, there is no barrier between first and second lookup. This might reduce race a little bit, but at the expense of extra code in the fast path. If you want to fix this properly, I think we need to revisit handling for non established sockets, to hold the bucket spinlock over the whole thing. This will be slightly more complex than this...