From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C7F83CF03E; Tue, 7 Apr 2026 15:58:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775577513; cv=none; b=kNR01rC6g+vuksSypFQY9Ng7n9ux0CG4iwQxNxRrki5paF17g7IFTBIhBL710KJ645asGUR4qzs62ifvvHXLKrLdBrPKbtf6n0a9by4UTOg0rIVjfqaETumFZD5DGdpXcf9gtJbeYMBdN63eZ4nfRwxJIoHI9r/psPSiTB3dWPc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775577513; c=relaxed/simple; bh=uMKW524ItQm2huwJsii0rZz2rhbgsj5zBmrSHlUEbnU=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Swd2pgkAmQN17EyoCGXoiWeTMAwv5pNPQqTBoJcufZ8Y7bt3jNxbZ5MOT9IgtfuftotaO/JQVNYqg76v+LpP1bcZoCKgB7oGwH4K6ZMLDdqG9DRDULMViTt8k5G0khmuMgv7hlMMhKpwVSWDDcFJ65v6TEDY3a5gwaaNiMq4UqA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=twI1mTON; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="twI1mTON" Received: by smtp.kernel.org (Postfix) with ESMTPSA id EE723C116C6; Tue, 7 Apr 2026 15:58:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1775577512; bh=uMKW524ItQm2huwJsii0rZz2rhbgsj5zBmrSHlUEbnU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=twI1mTONCtqzHlORVBPVF5XtsvsYU/GRjchbScBwnNWLlrUMP8sK8YIAsmfi91FVV a/i2ocO/jf0WOB6CWCxFsKmZSfrdcbb1GNLGjCbFHcAIp0pmgWK4A0CTRkENobuRG1 fWNjcMW4VyPFKFcMqjmqqvCX5eZ9G270613UMuUI7Zdlcgsch/LOTifeZHjcuPGQRl 2f0m49U7vLEjaexwkTieYcY2xeftuBh0fp0gkyRFgkfiMwv/xtcJhjOD834P70/wop PfXKC5OdhkkqlNWzoPLXEUNl0dsdgQ3SidDvBGE/GYT4SMBmRe9132QAITOikkzx1y pM+CbWVB/PfmA== Date: Tue, 7 Apr 2026 16:58:27 +0100 From: Lee Jones To: Kuniyuki Iwashima , stable@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Kuniyuki Iwashima , Linus Torvalds , netdev@vger.kernel.org, Igor Ushakov Subject: Re: [PATCH v3 net] af_unix: Give up GC if MSG_PEEK intervened. Message-ID: <20260407155827.GA1993342@google.com> References: <20260311054043.1231316-1-kuniyu@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20260311054043.1231316-1-kuniyu@google.com> INTENTIONAL TOP POST I note that this was not sent to Stable, but it should be included please. > Igor Ushakov reported that GC purged the receive queue of > an alive socket due to a race with MSG_PEEK with a nice repro. > > This is the exact same issue previously fixed by commit > cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK"). > > After GC was replaced with the current algorithm, the cited > commit removed the locking dance in unix_peek_fds() and > reintroduced the same issue. > > The problem is that MSG_PEEK bumps a file refcount without > interacting with GC. > > Consider an SCC containing sk-A and sk-B, where sk-A is > close()d but can be recv()ed via sk-B. > > The bad thing happens if sk-A is recv()ed with MSG_PEEK from > sk-B and sk-B is close()d while GC is checking unix_vertex_dead() > for sk-A and sk-B. > > GC thread User thread > --------- ----------- > unix_vertex_dead(sk-A) > -> true <------. > \ > `------ recv(sk-B, MSG_PEEK) > invalidate !! -> sk-A's file refcount : 1 -> 2 > > close(sk-B) > -> sk-B's file refcount : 2 -> 1 > unix_vertex_dead(sk-B) > -> true > > Initially, sk-A's file refcount is 1 by the inflight fd in sk-B > recvq. GC thinks sk-A is dead because the file refcount is the > same as the number of its inflight fds. > > However, sk-A's file refcount is bumped silently by MSG_PEEK, > which invalidates the previous evaluation. > > At this moment, sk-B's file refcount is 2; one by the open fd, > and one by the inflight fd in sk-A. The subsequent close() > releases one refcount by the former. > > Finally, GC incorrectly concludes that both sk-A and sk-B are dead. > > One option is to restore the locking dance in unix_peek_fds(), > but we can resolve this more elegantly thanks to the new algorithm. > > The point is that the issue does not occur without the subsequent > close() and we actually do not need to synchronise MSG_PEEK with > the dead SCC detection. > > When the issue occurs, close() and GC touch the same file refcount. > If GC sees the refcount being decremented by close(), it can just > give up garbage-collecting the SCC. > > Therefore, we only need to signal the race during MSG_PEEK with > a proper memory barrier to make it visible to the GC. > > Let's use seqcount_t to notify GC when MSG_PEEK occurs and let > it defer the SCC to the next run. > > This way no locking is needed on the MSG_PEEK side, and we can > avoid imposing a penalty on every MSG_PEEK unnecessarily. > > Note that we can retry within unix_scc_dead() if MSG_PEEK is > detected, but we do not do so to avoid hung task splat from > abusive MSG_PEEK calls. > > Fixes: 118f457da9ed ("af_unix: Remove lock dance in unix_peek_fds().") > Reported-by: Igor Ushakov > Signed-off-by: Kuniyuki Iwashima > --- > v3: Check !fpl and add spinlock in unix_peek_fpl() > v2: https://lore.kernel.org/all/20260309185823.3502204-1-kuniyu@google.com/ > * Use seqcount_t for proper memory barrier > v1: https://lore.kernel.org/netdev/20260308030406.1825938-1-kuniyu@google.com/ > --- > net/unix/af_unix.c | 2 ++ > net/unix/af_unix.h | 1 + > net/unix/garbage.c | 79 ++++++++++++++++++++++++++++++---------------- > 3 files changed, 54 insertions(+), 28 deletions(-) > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 7eaa5b187fef..b23c33df8b46 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -1958,6 +1958,8 @@ static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb) > static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb) > { > scm->fp = scm_fp_dup(UNIXCB(skb).fp); > + > + unix_peek_fpl(scm->fp); > } > > static void unix_destruct_scm(struct sk_buff *skb) > diff --git a/net/unix/af_unix.h b/net/unix/af_unix.h > index c4f1b2da363d..8119dbeef3a3 100644 > --- a/net/unix/af_unix.h > +++ b/net/unix/af_unix.h > @@ -29,6 +29,7 @@ void unix_del_edges(struct scm_fp_list *fpl); > void unix_update_edges(struct unix_sock *receiver); > int unix_prepare_fpl(struct scm_fp_list *fpl); > void unix_destroy_fpl(struct scm_fp_list *fpl); > +void unix_peek_fpl(struct scm_fp_list *fpl); > void unix_schedule_gc(struct user_struct *user); > > /* SOCK_DIAG */ > diff --git a/net/unix/garbage.c b/net/unix/garbage.c > index 816e8fa2b062..a7967a345827 100644 > --- a/net/unix/garbage.c > +++ b/net/unix/garbage.c > @@ -318,6 +318,25 @@ void unix_destroy_fpl(struct scm_fp_list *fpl) > unix_free_vertices(fpl); > } > > +static bool gc_in_progress; > +static seqcount_t unix_peek_seq = SEQCNT_ZERO(unix_peek_seq); > + > +void unix_peek_fpl(struct scm_fp_list *fpl) > +{ > + static DEFINE_SPINLOCK(unix_peek_lock); > + > + if (!fpl || !fpl->count_unix) > + return; > + > + if (!READ_ONCE(gc_in_progress)) > + return; > + > + /* Invalidate the final refcnt check in unix_vertex_dead(). */ > + spin_lock(&unix_peek_lock); > + raw_write_seqcount_barrier(&unix_peek_seq); > + spin_unlock(&unix_peek_lock); > +} > + > static bool unix_vertex_dead(struct unix_vertex *vertex) > { > struct unix_edge *edge; > @@ -351,6 +370,36 @@ static bool unix_vertex_dead(struct unix_vertex *vertex) > return true; > } > > +static LIST_HEAD(unix_visited_vertices); > +static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2; > + > +static bool unix_scc_dead(struct list_head *scc, bool fast) > +{ > + struct unix_vertex *vertex; > + bool scc_dead = true; > + unsigned int seq; > + > + seq = read_seqcount_begin(&unix_peek_seq); > + > + list_for_each_entry_reverse(vertex, scc, scc_entry) { > + /* Don't restart DFS from this vertex. */ > + list_move_tail(&vertex->entry, &unix_visited_vertices); > + > + /* Mark vertex as off-stack for __unix_walk_scc(). */ > + if (!fast) > + vertex->index = unix_vertex_grouped_index; > + > + if (scc_dead) > + scc_dead = unix_vertex_dead(vertex); > + } > + > + /* If MSG_PEEK intervened, defer this SCC to the next round. */ > + if (read_seqcount_retry(&unix_peek_seq, seq)) > + return false; > + > + return scc_dead; > +} > + > static void unix_collect_skb(struct list_head *scc, struct sk_buff_head *hitlist) > { > struct unix_vertex *vertex; > @@ -404,9 +453,6 @@ static bool unix_scc_cyclic(struct list_head *scc) > return false; > } > > -static LIST_HEAD(unix_visited_vertices); > -static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2; > - > static unsigned long __unix_walk_scc(struct unix_vertex *vertex, > unsigned long *last_index, > struct sk_buff_head *hitlist) > @@ -474,9 +520,7 @@ static unsigned long __unix_walk_scc(struct unix_vertex *vertex, > } > > if (vertex->index == vertex->scc_index) { > - struct unix_vertex *v; > struct list_head scc; > - bool scc_dead = true; > > /* SCC finalised. > * > @@ -485,18 +529,7 @@ static unsigned long __unix_walk_scc(struct unix_vertex *vertex, > */ > __list_cut_position(&scc, &vertex_stack, &vertex->scc_entry); > > - list_for_each_entry_reverse(v, &scc, scc_entry) { > - /* Don't restart DFS from this vertex in unix_walk_scc(). */ > - list_move_tail(&v->entry, &unix_visited_vertices); > - > - /* Mark vertex as off-stack. */ > - v->index = unix_vertex_grouped_index; > - > - if (scc_dead) > - scc_dead = unix_vertex_dead(v); > - } > - > - if (scc_dead) { > + if (unix_scc_dead(&scc, false)) { > unix_collect_skb(&scc, hitlist); > } else { > if (unix_vertex_max_scc_index < vertex->scc_index) > @@ -550,19 +583,11 @@ static void unix_walk_scc_fast(struct sk_buff_head *hitlist) > while (!list_empty(&unix_unvisited_vertices)) { > struct unix_vertex *vertex; > struct list_head scc; > - bool scc_dead = true; > > vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry); > list_add(&scc, &vertex->scc_entry); > > - list_for_each_entry_reverse(vertex, &scc, scc_entry) { > - list_move_tail(&vertex->entry, &unix_visited_vertices); > - > - if (scc_dead) > - scc_dead = unix_vertex_dead(vertex); > - } > - > - if (scc_dead) { > + if (unix_scc_dead(&scc, true)) { > cyclic_sccs--; > unix_collect_skb(&scc, hitlist); > } > @@ -577,8 +602,6 @@ static void unix_walk_scc_fast(struct sk_buff_head *hitlist) > cyclic_sccs ? UNIX_GRAPH_CYCLIC : UNIX_GRAPH_NOT_CYCLIC); > } > > -static bool gc_in_progress; > - > static void unix_gc(struct work_struct *work) > { > struct sk_buff_head hitlist; > -- > 2.53.0.473.g4a7958ca14-goog > -- Lee Jones [李琼斯]