From: Lee Jones <lee@kernel.org>
To: Kuniyuki Iwashima <kuniyu@google.com>, stable@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>,
Kuniyuki Iwashima <kuni1840@gmail.com>,
Linus Torvalds <torvalds@linuxfoundation.org>,
netdev@vger.kernel.org, Igor Ushakov <sysroot314@gmail.com>
Subject: Re: [PATCH v3 net] af_unix: Give up GC if MSG_PEEK intervened.
Date: Tue, 7 Apr 2026 16:58:27 +0100 [thread overview]
Message-ID: <20260407155827.GA1993342@google.com> (raw)
In-Reply-To: <20260311054043.1231316-1-kuniyu@google.com>
INTENTIONAL TOP POST
I note that this was not sent to Stable, but it should be included please.
> Igor Ushakov reported that GC purged the receive queue of
> an alive socket due to a race with MSG_PEEK with a nice repro.
>
> This is the exact same issue previously fixed by commit
> cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").
>
> After GC was replaced with the current algorithm, the cited
> commit removed the locking dance in unix_peek_fds() and
> reintroduced the same issue.
>
> The problem is that MSG_PEEK bumps a file refcount without
> interacting with GC.
>
> Consider an SCC containing sk-A and sk-B, where sk-A is
> close()d but can be recv()ed via sk-B.
>
> The bad thing happens if sk-A is recv()ed with MSG_PEEK from
> sk-B and sk-B is close()d while GC is checking unix_vertex_dead()
> for sk-A and sk-B.
>
> GC thread User thread
> --------- -----------
> unix_vertex_dead(sk-A)
> -> true <------.
> \
> `------ recv(sk-B, MSG_PEEK)
> invalidate !! -> sk-A's file refcount : 1 -> 2
>
> close(sk-B)
> -> sk-B's file refcount : 2 -> 1
> unix_vertex_dead(sk-B)
> -> true
>
> Initially, sk-A's file refcount is 1 by the inflight fd in sk-B
> recvq. GC thinks sk-A is dead because the file refcount is the
> same as the number of its inflight fds.
>
> However, sk-A's file refcount is bumped silently by MSG_PEEK,
> which invalidates the previous evaluation.
>
> At this moment, sk-B's file refcount is 2; one by the open fd,
> and one by the inflight fd in sk-A. The subsequent close()
> releases one refcount by the former.
>
> Finally, GC incorrectly concludes that both sk-A and sk-B are dead.
>
> One option is to restore the locking dance in unix_peek_fds(),
> but we can resolve this more elegantly thanks to the new algorithm.
>
> The point is that the issue does not occur without the subsequent
> close() and we actually do not need to synchronise MSG_PEEK with
> the dead SCC detection.
>
> When the issue occurs, close() and GC touch the same file refcount.
> If GC sees the refcount being decremented by close(), it can just
> give up garbage-collecting the SCC.
>
> Therefore, we only need to signal the race during MSG_PEEK with
> a proper memory barrier to make it visible to the GC.
>
> Let's use seqcount_t to notify GC when MSG_PEEK occurs and let
> it defer the SCC to the next run.
>
> This way no locking is needed on the MSG_PEEK side, and we can
> avoid imposing a penalty on every MSG_PEEK unnecessarily.
>
> Note that we can retry within unix_scc_dead() if MSG_PEEK is
> detected, but we do not do so to avoid hung task splat from
> abusive MSG_PEEK calls.
>
> Fixes: 118f457da9ed ("af_unix: Remove lock dance in unix_peek_fds().")
> Reported-by: Igor Ushakov <sysroot314@gmail.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
> v3: Check !fpl and add spinlock in unix_peek_fpl()
> v2: https://lore.kernel.org/all/20260309185823.3502204-1-kuniyu@google.com/
> * Use seqcount_t for proper memory barrier
> v1: https://lore.kernel.org/netdev/20260308030406.1825938-1-kuniyu@google.com/
> ---
> net/unix/af_unix.c | 2 ++
> net/unix/af_unix.h | 1 +
> net/unix/garbage.c | 79 ++++++++++++++++++++++++++++++----------------
> 3 files changed, 54 insertions(+), 28 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 7eaa5b187fef..b23c33df8b46 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1958,6 +1958,8 @@ static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
> static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
> {
> scm->fp = scm_fp_dup(UNIXCB(skb).fp);
> +
> + unix_peek_fpl(scm->fp);
> }
>
> static void unix_destruct_scm(struct sk_buff *skb)
> diff --git a/net/unix/af_unix.h b/net/unix/af_unix.h
> index c4f1b2da363d..8119dbeef3a3 100644
> --- a/net/unix/af_unix.h
> +++ b/net/unix/af_unix.h
> @@ -29,6 +29,7 @@ void unix_del_edges(struct scm_fp_list *fpl);
> void unix_update_edges(struct unix_sock *receiver);
> int unix_prepare_fpl(struct scm_fp_list *fpl);
> void unix_destroy_fpl(struct scm_fp_list *fpl);
> +void unix_peek_fpl(struct scm_fp_list *fpl);
> void unix_schedule_gc(struct user_struct *user);
>
> /* SOCK_DIAG */
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 816e8fa2b062..a7967a345827 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -318,6 +318,25 @@ void unix_destroy_fpl(struct scm_fp_list *fpl)
> unix_free_vertices(fpl);
> }
>
> +static bool gc_in_progress;
> +static seqcount_t unix_peek_seq = SEQCNT_ZERO(unix_peek_seq);
> +
> +void unix_peek_fpl(struct scm_fp_list *fpl)
> +{
> + static DEFINE_SPINLOCK(unix_peek_lock);
> +
> + if (!fpl || !fpl->count_unix)
> + return;
> +
> + if (!READ_ONCE(gc_in_progress))
> + return;
> +
> + /* Invalidate the final refcnt check in unix_vertex_dead(). */
> + spin_lock(&unix_peek_lock);
> + raw_write_seqcount_barrier(&unix_peek_seq);
> + spin_unlock(&unix_peek_lock);
> +}
> +
> static bool unix_vertex_dead(struct unix_vertex *vertex)
> {
> struct unix_edge *edge;
> @@ -351,6 +370,36 @@ static bool unix_vertex_dead(struct unix_vertex *vertex)
> return true;
> }
>
> +static LIST_HEAD(unix_visited_vertices);
> +static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> +
> +static bool unix_scc_dead(struct list_head *scc, bool fast)
> +{
> + struct unix_vertex *vertex;
> + bool scc_dead = true;
> + unsigned int seq;
> +
> + seq = read_seqcount_begin(&unix_peek_seq);
> +
> + list_for_each_entry_reverse(vertex, scc, scc_entry) {
> + /* Don't restart DFS from this vertex. */
> + list_move_tail(&vertex->entry, &unix_visited_vertices);
> +
> + /* Mark vertex as off-stack for __unix_walk_scc(). */
> + if (!fast)
> + vertex->index = unix_vertex_grouped_index;
> +
> + if (scc_dead)
> + scc_dead = unix_vertex_dead(vertex);
> + }
> +
> + /* If MSG_PEEK intervened, defer this SCC to the next round. */
> + if (read_seqcount_retry(&unix_peek_seq, seq))
> + return false;
> +
> + return scc_dead;
> +}
> +
> static void unix_collect_skb(struct list_head *scc, struct sk_buff_head *hitlist)
> {
> struct unix_vertex *vertex;
> @@ -404,9 +453,6 @@ static bool unix_scc_cyclic(struct list_head *scc)
> return false;
> }
>
> -static LIST_HEAD(unix_visited_vertices);
> -static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2;
> -
> static unsigned long __unix_walk_scc(struct unix_vertex *vertex,
> unsigned long *last_index,
> struct sk_buff_head *hitlist)
> @@ -474,9 +520,7 @@ static unsigned long __unix_walk_scc(struct unix_vertex *vertex,
> }
>
> if (vertex->index == vertex->scc_index) {
> - struct unix_vertex *v;
> struct list_head scc;
> - bool scc_dead = true;
>
> /* SCC finalised.
> *
> @@ -485,18 +529,7 @@ static unsigned long __unix_walk_scc(struct unix_vertex *vertex,
> */
> __list_cut_position(&scc, &vertex_stack, &vertex->scc_entry);
>
> - list_for_each_entry_reverse(v, &scc, scc_entry) {
> - /* Don't restart DFS from this vertex in unix_walk_scc(). */
> - list_move_tail(&v->entry, &unix_visited_vertices);
> -
> - /* Mark vertex as off-stack. */
> - v->index = unix_vertex_grouped_index;
> -
> - if (scc_dead)
> - scc_dead = unix_vertex_dead(v);
> - }
> -
> - if (scc_dead) {
> + if (unix_scc_dead(&scc, false)) {
> unix_collect_skb(&scc, hitlist);
> } else {
> if (unix_vertex_max_scc_index < vertex->scc_index)
> @@ -550,19 +583,11 @@ static void unix_walk_scc_fast(struct sk_buff_head *hitlist)
> while (!list_empty(&unix_unvisited_vertices)) {
> struct unix_vertex *vertex;
> struct list_head scc;
> - bool scc_dead = true;
>
> vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry);
> list_add(&scc, &vertex->scc_entry);
>
> - list_for_each_entry_reverse(vertex, &scc, scc_entry) {
> - list_move_tail(&vertex->entry, &unix_visited_vertices);
> -
> - if (scc_dead)
> - scc_dead = unix_vertex_dead(vertex);
> - }
> -
> - if (scc_dead) {
> + if (unix_scc_dead(&scc, true)) {
> cyclic_sccs--;
> unix_collect_skb(&scc, hitlist);
> }
> @@ -577,8 +602,6 @@ static void unix_walk_scc_fast(struct sk_buff_head *hitlist)
> cyclic_sccs ? UNIX_GRAPH_CYCLIC : UNIX_GRAPH_NOT_CYCLIC);
> }
>
> -static bool gc_in_progress;
> -
> static void unix_gc(struct work_struct *work)
> {
> struct sk_buff_head hitlist;
> --
> 2.53.0.473.g4a7958ca14-goog
>
--
Lee Jones [李琼斯]
next prev parent reply other threads:[~2026-04-07 15:58 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-11 5:40 [PATCH v3 net] af_unix: Give up GC if MSG_PEEK intervened Kuniyuki Iwashima
2026-03-12 20:40 ` patchwork-bot+netdevbpf
2026-04-07 15:58 ` Lee Jones [this message]
2026-04-07 16:03 ` Kuniyuki Iwashima
2026-04-07 17:01 ` Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260407155827.GA1993342@google.com \
--to=lee@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuni1840@gmail.com \
--cc=kuniyu@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=stable@vger.kernel.org \
--cc=sysroot314@gmail.com \
--cc=torvalds@linuxfoundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox