From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C7F83CF03E;
	Tue,  7 Apr 2026 15:58:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775577513; cv=none; b=kNR01rC6g+vuksSypFQY9Ng7n9ux0CG4iwQxNxRrki5paF17g7IFTBIhBL710KJ645asGUR4qzs62ifvvHXLKrLdBrPKbtf6n0a9by4UTOg0rIVjfqaETumFZD5DGdpXcf9gtJbeYMBdN63eZ4nfRwxJIoHI9r/psPSiTB3dWPc=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775577513; c=relaxed/simple;
	bh=uMKW524ItQm2huwJsii0rZz2rhbgsj5zBmrSHlUEbnU=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=Swd2pgkAmQN17EyoCGXoiWeTMAwv5pNPQqTBoJcufZ8Y7bt3jNxbZ5MOT9IgtfuftotaO/JQVNYqg76v+LpP1bcZoCKgB7oGwH4K6ZMLDdqG9DRDULMViTt8k5G0khmuMgv7hlMMhKpwVSWDDcFJ65v6TEDY3a5gwaaNiMq4UqA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=twI1mTON; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="twI1mTON"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id EE723C116C6;
	Tue,  7 Apr 2026 15:58:29 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1775577512;
	bh=uMKW524ItQm2huwJsii0rZz2rhbgsj5zBmrSHlUEbnU=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=twI1mTONCtqzHlORVBPVF5XtsvsYU/GRjchbScBwnNWLlrUMP8sK8YIAsmfi91FVV
	 a/i2ocO/jf0WOB6CWCxFsKmZSfrdcbb1GNLGjCbFHcAIp0pmgWK4A0CTRkENobuRG1
	 fWNjcMW4VyPFKFcMqjmqqvCX5eZ9G270613UMuUI7Zdlcgsch/LOTifeZHjcuPGQRl
	 2f0m49U7vLEjaexwkTieYcY2xeftuBh0fp0gkyRFgkfiMwv/xtcJhjOD834P70/wop
	 PfXKC5OdhkkqlNWzoPLXEUNl0dsdgQ3SidDvBGE/GYT4SMBmRe9132QAITOikkzx1y
	 pM+CbWVB/PfmA==
Date: Tue, 7 Apr 2026 16:58:27 +0100
From: Lee Jones <lee@kernel.org>
To: Kuniyuki Iwashima <kuniyu@google.com>, stable@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Kuniyuki Iwashima <kuni1840@gmail.com>,
	Linus Torvalds <torvalds@linuxfoundation.org>,
	netdev@vger.kernel.org, Igor Ushakov <sysroot314@gmail.com>
Subject: Re: [PATCH v3 net] af_unix: Give up GC if MSG_PEEK intervened.
Message-ID: <20260407155827.GA1993342@google.com>
References: <20260311054043.1231316-1-kuniyu@google.com>
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20260311054043.1231316-1-kuniyu@google.com>

INTENTIONAL TOP POST

I note that this was not sent to Stable, but it should be included please.

> Igor Ushakov reported that GC purged the receive queue of
> an alive socket due to a race with MSG_PEEK with a nice repro.
> 
> This is the exact same issue previously fixed by commit
> cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").
> 
> After GC was replaced with the current algorithm, the cited
> commit removed the locking dance in unix_peek_fds() and
> reintroduced the same issue.
> 
> The problem is that MSG_PEEK bumps a file refcount without
> interacting with GC.
> 
> Consider an SCC containing sk-A and sk-B, where sk-A is
> close()d but can be recv()ed via sk-B.
> 
> The bad thing happens if sk-A is recv()ed with MSG_PEEK from
> sk-B and sk-B is close()d while GC is checking unix_vertex_dead()
> for sk-A and sk-B.
> 
>   GC thread                    User thread
>   ---------                    -----------
>   unix_vertex_dead(sk-A)
>   -> true   <------.
>                     \
>                      `------   recv(sk-B, MSG_PEEK)
>               invalidate !!    -> sk-A's file refcount : 1 -> 2
> 
>                                close(sk-B)
>                                -> sk-B's file refcount : 2 -> 1
>   unix_vertex_dead(sk-B)
>   -> true
> 
> Initially, sk-A's file refcount is 1 by the inflight fd in sk-B
> recvq.  GC thinks sk-A is dead because the file refcount is the
> same as the number of its inflight fds.
> 
> However, sk-A's file refcount is bumped silently by MSG_PEEK,
> which invalidates the previous evaluation.
> 
> At this moment, sk-B's file refcount is 2; one by the open fd,
> and one by the inflight fd in sk-A.  The subsequent close()
> releases one refcount by the former.
> 
> Finally, GC incorrectly concludes that both sk-A and sk-B are dead.
> 
> One option is to restore the locking dance in unix_peek_fds(),
> but we can resolve this more elegantly thanks to the new algorithm.
> 
> The point is that the issue does not occur without the subsequent
> close() and we actually do not need to synchronise MSG_PEEK with
> the dead SCC detection.
> 
> When the issue occurs, close() and GC touch the same file refcount.
> If GC sees the refcount being decremented by close(), it can just
> give up garbage-collecting the SCC.
> 
> Therefore, we only need to signal the race during MSG_PEEK with
> a proper memory barrier to make it visible to the GC.
> 
> Let's use seqcount_t to notify GC when MSG_PEEK occurs and let
> it defer the SCC to the next run.
> 
> This way no locking is needed on the MSG_PEEK side, and we can
> avoid imposing a penalty on every MSG_PEEK unnecessarily.
> 
> Note that we can retry within unix_scc_dead() if MSG_PEEK is
> detected, but we do not do so to avoid hung task splat from
> abusive MSG_PEEK calls.
> 
> Fixes: 118f457da9ed ("af_unix: Remove lock dance in unix_peek_fds().")
> Reported-by: Igor Ushakov <sysroot314@gmail.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
> v3: Check !fpl and add spinlock in unix_peek_fpl()
> v2: https://lore.kernel.org/all/20260309185823.3502204-1-kuniyu@google.com/
>   * Use seqcount_t for proper memory barrier
> v1: https://lore.kernel.org/netdev/20260308030406.1825938-1-kuniyu@google.com/
> ---
>  net/unix/af_unix.c |  2 ++
>  net/unix/af_unix.h |  1 +
>  net/unix/garbage.c | 79 ++++++++++++++++++++++++++++++----------------
>  3 files changed, 54 insertions(+), 28 deletions(-)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 7eaa5b187fef..b23c33df8b46 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1958,6 +1958,8 @@ static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
>  static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
>  {
>  	scm->fp = scm_fp_dup(UNIXCB(skb).fp);
> +
> +	unix_peek_fpl(scm->fp);
>  }
>  
>  static void unix_destruct_scm(struct sk_buff *skb)
> diff --git a/net/unix/af_unix.h b/net/unix/af_unix.h
> index c4f1b2da363d..8119dbeef3a3 100644
> --- a/net/unix/af_unix.h
> +++ b/net/unix/af_unix.h
> @@ -29,6 +29,7 @@ void unix_del_edges(struct scm_fp_list *fpl);
>  void unix_update_edges(struct unix_sock *receiver);
>  int unix_prepare_fpl(struct scm_fp_list *fpl);
>  void unix_destroy_fpl(struct scm_fp_list *fpl);
> +void unix_peek_fpl(struct scm_fp_list *fpl);
>  void unix_schedule_gc(struct user_struct *user);
>  
>  /* SOCK_DIAG */
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 816e8fa2b062..a7967a345827 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -318,6 +318,25 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
>  	unix_free_vertices(fpl);
>  }
>  
> +static bool gc_in_progress;
> +static seqcount_t unix_peek_seq = SEQCNT_ZERO(unix_peek_seq);
> +
> +void unix_peek_fpl(struct scm_fp_list *fpl)
> +{
> +	static DEFINE_SPINLOCK(unix_peek_lock);
> +
> +	if (!fpl || !fpl->count_unix)
> +		return;
> +
> +	if (!READ_ONCE(gc_in_progress))
> +		return;
> +
> +	/* Invalidate the final refcnt check in unix_vertex_dead(). */
> +	spin_lock(&unix_peek_lock);
> +	raw_write_seqcount_barrier(&unix_peek_seq);
> +	spin_unlock(&unix_peek_lock);
> +}
> +
>  static bool unix_vertex_dead(struct unix_vertex *vertex)
>  {
>  	struct unix_edge *edge;
> @@ -351,6 +370,36 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
>  	return true;
>  }
>  
> +static LIST_HEAD(unix_visited_vertices);
> +static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> +
> +static bool unix_scc_dead(struct list_head *scc, bool fast)
> +{
> +	struct unix_vertex *vertex;
> +	bool scc_dead = true;
> +	unsigned int seq;
> +
> +	seq = read_seqcount_begin(&unix_peek_seq);
> +
> +	list_for_each_entry_reverse(vertex, scc, scc_entry) {
> +		/* Don't restart DFS from this vertex. */
> +		list_move_tail(&vertex->entry, &unix_visited_vertices);
> +
> +		/* Mark vertex as off-stack for __unix_walk_scc(). */
> +		if (!fast)
> +			vertex->index = unix_vertex_grouped_index;
> +
> +		if (scc_dead)
> +			scc_dead = unix_vertex_dead(vertex);
> +	}
> +
> +	/* If MSG_PEEK intervened, defer this SCC to the next round. */
> +	if (read_seqcount_retry(&unix_peek_seq, seq))
> +		return false;
> +
> +	return scc_dead;
> +}
> +
>  static void unix_collect_skb(struct list_head *scc, struct sk_buff_head *hitlist)
>  {
>  	struct unix_vertex *vertex;
> @@ -404,9 +453,6 @@ static bool unix_scc_cyclic(struct list_head *scc)
>  	return false;
>  }
>  
> -static LIST_HEAD(unix_visited_vertices);
> -static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> -
>  static unsigned long __unix_walk_scc(struct unix_vertex *vertex,
>  				     unsigned long *last_index,
>  				     struct sk_buff_head *hitlist)
> @@ -474,9 +520,7 @@ static unsigned long __unix_walk_scc(struct unix_vertex *vertex,
>  	}
>  
>  	if (vertex->index == vertex->scc_index) {
> -		struct unix_vertex *v;
>  		struct list_head scc;
> -		bool scc_dead = true;
>  
>  		/* SCC finalised.
>  		 *
> @@ -485,18 +529,7 @@ static unsigned long __unix_walk_scc(struct unix_vertex *vertex,
>  		 */
>  		__list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
>  
> -		list_for_each_entry_reverse(v, &scc, scc_entry) {
> -			/* Don't restart DFS from this vertex in unix_walk_scc(). */
> -			list_move_tail(&v->entry, &unix_visited_vertices);
> -
> -			/* Mark vertex as off-stack. */
> -			v->index = unix_vertex_grouped_index;
> -
> -			if (scc_dead)
> -				scc_dead = unix_vertex_dead(v);
> -		}
> -
> -		if (scc_dead) {
> +		if (unix_scc_dead(&scc, false)) {
>  			unix_collect_skb(&scc, hitlist);
>  		} else {
>  			if (unix_vertex_max_scc_index < vertex->scc_index)
> @@ -550,19 +583,11 @@ static void unix_walk_scc_fast(struct sk_buff_head *hitlist)
>  	while (!list_empty(&unix_unvisited_vertices)) {
>  		struct unix_vertex *vertex;
>  		struct list_head scc;
> -		bool scc_dead = true;
>  
>  		vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
>  		list_add(&scc, &vertex->scc_entry);
>  
> -		list_for_each_entry_reverse(vertex, &scc, scc_entry) {
> -			list_move_tail(&vertex->entry, &unix_visited_vertices);
> -
> -			if (scc_dead)
> -				scc_dead = unix_vertex_dead(vertex);
> -		}
> -
> -		if (scc_dead) {
> +		if (unix_scc_dead(&scc, true)) {
>  			cyclic_sccs--;
>  			unix_collect_skb(&scc, hitlist);
>  		}
> @@ -577,8 +602,6 @@ static void unix_walk_scc_fast(struct sk_buff_head *hitlist)
>  		   cyclic_sccs ? UNIX_GRAPH_CYCLIC : UNIX_GRAPH_NOT_CYCLIC);
>  }
>  
> -static bool gc_in_progress;
> -
>  static void unix_gc(struct work_struct *work)
>  {
>  	struct sk_buff_head hitlist;
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 

-- 
Lee Jones [李琼斯]