All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Kuniyuki Iwashima <kuniyu@google.com>,
	Michal Luczaj <mhal@rbox.co>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	John Fastabend <john.fastabend@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH bpf] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
Date: Mon, 2 Feb 2026 11:15:25 -0800	[thread overview]
Message-ID: <7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev> (raw)
In-Reply-To: <CAAVpQUAc3S+Ebx2qp-Pbg9qH-Zc=yHdzVJbchCu++V_4XF3DLg@mail.gmail.com>

On 1/31/26 2:06 AM, Kuniyuki Iwashima wrote:
> On Fri, Jan 30, 2026 at 1:30 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/30/26 3:00 AM, Michal Luczaj wrote:
>>>>> Follow-up to discussion at
>>>>> https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
>>>>
>>>> It is a long thread to dig. Please summarize the discussion in the
>>>> commit message.
>>>
>>> OK, there we go:
>>>
>>> The root cause of the null-ptr-deref is that unix_stream_connect() sets
>>> sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns
>>> a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes
>>> sock_map_sk_state_allowed() believe that socket is properly set up, which
>>> would include having a defined peer.
>>>
>>> In other words, there's a window when you can call
>>> unix_stream_bpf_update_proto() on socket which still has unix_peer(sk) == NULL.
>>>
>>> My initial idea was to simply move peer assignment _before_ the sk_state
>>> update, but the maintainer wasn't interested in changing the
>>> unix_stream_connect() hot path. He suggested taking care of it in the
>>> sockmap code.
> 
> Yes, we already have a memory barrier for unix_peer(sk) there
> (to save sock_hold()/sock_put() in sendmsg(), see 830a1e5c212fb)
> and another one just for sk->sk_state is not worth the unlikely
> case in sockmap by a buggy user.
> 
> 
>>>
>>> My understanding is that users are not supposed to put sockets in a sockmap
>>> when said socket is only half-way through connect() call. Hence `return
>>> -EINVAL` on a missing peer. Now, if users should be allowed to legally race
>>> connect() vs. sockmap update, then I guess we can wait for connect() to
>>> "finalize" e.g. by taking the unix_state_lock(), as discussed below.
> 
> If a user hit the issue, the user must have updated sockmap while the
> user knows connect() had not returned.  Such a user must prepare
> for failures since it could occur before sock_map_sk_state_allowed() too.
> 
> This is a subtle timing issue and I don't think the kernel should be
> friendly to such buggy users by waiting for connect() etc.

I don't have a use case for parallel connect and map update either. 
Also, I have no concern about returning -EINVAL in map_update for the 
not-yet-completed unix_stream_connect().

However, TCP/UDP (and probably vsock also?) do not have this racing 
issue because sock_map follows the lock usage in TCP/UDP as in other 
parts of the kernel. Why AF_UNIX is an exception and unix_state_lock() 
is not used in sock_map.

John and Jakub Stinicki, could this be an oversight?

>>>
>>>>    From looking at this commit message, if the existing lock_sock held by
>>>> update_elem is not useful for af_unix,
>>>
>>> Right, the existing lock_sock is not useful. update's lock_sock holds
>>> sock::sk_lock, while unix_state_lock() holds unix_sock::lock.
>>
>> It sounds like lock_sock is the incorrect lock to hold for af_unix. Is
>> taking lock_sock in sock_map doing anything useful for af_unix? Should
>> sock_map hold the unix_state_lock instead of lock_sock?
> 
> If sockmap code does not sleep, unix_state_lock can be used there.

afaik, bpf prog using the sockmap cannot sleep. Search for "if 
(prog->sleepable)" in "check_map_prog_compatibility()". The bpf prog 
attached to sock_map cannot sleep either, there is "can_be_sleepable()" 
check in the verifier.

>>>
>>>           if (!psock->sk_pair) {
>>> +               unix_state_lock(sk);
>>> +               unix_state_unlock(sk);
> 
> I don't like this... we had a similar one in recvmsg(MSG_PEEK) path
> for GC with a biiiiiig comment, which I removed in 118f457da9ed .

Me neither, for both empty critical section and big fat comment. This 
usually suggests something else is incorrect to begin with. I believe 
the wrong lock is taken in this case. Unless there is something 
prohibiting taking unix_state_lock in sock_map, fix it properly instead.

> 
> 
>>>                   sk_pair = unix_peer(sk);
>>>                   sock_hold(sk_pair);
>>
>> I don't have a strong opinion on waiting or checking NULL. imo, both are
>> not easy to understand. One is sk_state had already been checked earlier
>> under a lock_sock but still needs to check NULL on unix_peer(). Another
>> one is an empty unix_state_[un]lock(). If taking unix_state_lock, may as
>> well just use the existing unix_peer_get(sk).
> 
> Yes, unix_peer_get() can be safely used there (with extra sock_put()).

The sock_map needs to sock_hold(sk_pair) anyway and stores it in 
psock->sk_pair, so I don't think it needs an extra sock_put().

> 
> 
>> If its return value cannot
>> (?) be NULL, WARN_ON_ONCE() instead of having a special empty
> 
> I suggested WARN_ON_ONCE() because Michal reproduced it with
> mdelay() and I did not think it could occur in real life, but considering
> PREEMPT_RT, it could be real.  So, the current form in this patch looks
> good to me.

hmm... If unix_peer_get(sk) is used and it takes (and waits for) the 
unix_state_lock, it shouldn't be NULL? The above empty [un]lock idea 
does not check for NULL on unix_peer() either. Or am I missing something?

Regardless, if the proper lock is held, all this complication and 
reasoning will go away.


  parent reply	other threads:[~2026-02-02 19:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-29 16:47 [PATCH bpf] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
2026-01-29 19:41 ` Martin KaFai Lau
2026-01-30 11:00   ` Michal Luczaj
2026-01-30 21:29     ` Martin KaFai Lau
2026-01-31 10:06       ` Kuniyuki Iwashima
2026-02-02 15:10         ` Michal Luczaj
2026-02-03  3:53           ` Martin KaFai Lau
2026-02-03  9:57             ` Michal Luczaj
2026-02-03 19:47               ` Kuniyuki Iwashima
2026-02-04  7:15                 ` Martin KaFai Lau
2026-02-04  7:58                   ` Kuniyuki Iwashima
2026-02-04 15:41                     ` Michal Luczaj
2026-02-04 19:16                       ` Kuniyuki Iwashima
2026-02-04 20:18                         ` Martin KaFai Lau
2026-02-04 19:34                       ` Martin KaFai Lau
2026-02-04 21:09                         ` Kuniyuki Iwashima
2026-02-05  0:55                           ` Martin KaFai Lau
2026-02-05  2:00                             ` Martin KaFai Lau
2026-02-05  7:39                               ` Kuniyuki Iwashima
2026-02-04 23:25                         ` Michal Luczaj
2026-02-05  0:27                           ` Kuniyuki Iwashima
2026-02-05  0:31                           ` Martin KaFai Lau
2026-02-02 19:15         ` Martin KaFai Lau [this message]
2026-02-07 14:37           ` Michal Luczaj

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=jakub@cloudflare.com \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhal@rbox.co \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.