public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
@ 2026-02-14 23:24 Sebastian Andrzej Siewior
  2026-02-15  4:05 ` Willem de Bruijn
  2026-02-17  8:01 ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-14 23:24 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kurt Kanzenbach,
	Paolo Abeni, Simon Horman, Willem de Bruijn

skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must
not be taken in IRQ context, only softirq is okay. A few drivers receive
the timestamp via a dedicated interrupt and complete the TX timestamp
from that handler. This will lead to a deadlock if the lock is already
write-locked on the same CPU.

Taking the lock can be avoided. The socket (pointed by the skb) will
remain valid until the skb is released. The ->sk_socket and ->file
member will be set to NULL once the user closes the socket which may
happen before the timestamp arrives.
If we happen to observe the pointer while the socket is closing but
before the pointer is set to NULL then we may use it because both
pointer (and the file's cred member) are RCU freed and the IRQ and
softirq handler qualify as a RCU-read section.

Fixes: b245be1f4db1a ("net-timestamp: no-payload only sysctl")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/skbuff.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 61746c2b95f63..a174010e334a1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5555,16 +5555,25 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
 
 static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
 {
-	bool ret;
+	struct socket *sock;
+	struct file *file;
 
 	if (likely(tsonly || READ_ONCE(sock_net(sk)->core.sysctl_tstamp_allow_data)))
 		return true;
 
-	read_lock_bh(&sk->sk_callback_lock);
-	ret = sk->sk_socket && sk->sk_socket->file &&
-	      file_ns_capable(sk->sk_socket->file, &init_user_ns, CAP_NET_RAW);
-	read_unlock_bh(&sk->sk_callback_lock);
-	return ret;
+	/* The sk pointer remains valid as long as the skb is. The sk_socket and
+	 * file pointer may become NULL if the socket is closed. Both structures
+	 * (including file->cred) are RCU freed which means they can be accessed
+	 * within a RCU read section.
+	 */
+	lockdep_assert_in_rcu_reader();
+	sock = READ_ONCE(sk->sk_socket);
+	if (!sock)
+		return false;
+	file = READ_ONCE(sock->file);
+	if (!file)
+		return false;
+	return file_ns_capable(file, &init_user_ns, CAP_NET_RAW);
 }
 
 void skb_complete_tx_timestamp(struct sk_buff *skb,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-14 23:24 [PATCH net] net: Drop the lock in skb_may_tx_timestamp() Sebastian Andrzej Siewior
@ 2026-02-15  4:05 ` Willem de Bruijn
  2026-02-15 18:46   ` Sebastian Andrzej Siewior
  2026-02-17  8:01 ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Willem de Bruijn @ 2026-02-15  4:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kurt Kanzenbach,
	Paolo Abeni, Simon Horman, Willem de Bruijn

Sebastian Andrzej Siewior wrote:
> skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must
> not be taken in IRQ context, only softirq is okay. A few drivers receive
> the timestamp via a dedicated interrupt and complete the TX timestamp
> from that handler. This will lead to a deadlock if the lock is already
> write-locked on the same CPU.
> 
> Taking the lock can be avoided. The socket (pointed by the skb) will
> remain valid until the skb is released. The ->sk_socket and ->file
> member will be set to NULL once the user closes the socket which may
> happen before the timestamp arrives.
> If we happen to observe the pointer while the socket is closing but
> before the pointer is set to NULL then we may use it because both
> pointer (and the file's cred member) are RCU freed and the IRQ and
> softirq handler qualify as a RCU-read section.
> 
> Fixes: b245be1f4db1a ("net-timestamp: no-payload only sysctl")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/skbuff.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 61746c2b95f63..a174010e334a1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5555,16 +5555,25 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
>  
>  static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
>  {
> -	bool ret;
> +	struct socket *sock;
> +	struct file *file;
>  
>  	if (likely(tsonly || READ_ONCE(sock_net(sk)->core.sysctl_tstamp_allow_data)))
>  		return true;
>  
> -	read_lock_bh(&sk->sk_callback_lock);
> -	ret = sk->sk_socket && sk->sk_socket->file &&
> -	      file_ns_capable(sk->sk_socket->file, &init_user_ns, CAP_NET_RAW);
> -	read_unlock_bh(&sk->sk_callback_lock);
> -	return ret;
> +	/* The sk pointer remains valid as long as the skb is. The sk_socket and
> +	 * file pointer may become NULL if the socket is closed. Both structures
> +	 * (including file->cred) are RCU freed which means they can be accessed
> +	 * within a RCU read section.

The fields are not __rcu annotated. Is this common knowledge or is
there clear documentation of this behavior? Not disputing it, just not
familiar with the rules around these fields.

It does seem that updates to sk_socket, in sock_orphan and
sock_graft, are protected by sk_callback_lock.
> +	 */
> +	lockdep_assert_in_rcu_reader();

This function can be called from process context outside an RCU read
side section, I think.

Such as a packet socket sending and hitting the __skb_tstamp_tx
SCM_TSTAMP_SCHED at the top of __dev_queue_xmit.

> +	sock = READ_ONCE(sk->sk_socket);
> +	if (!sock)
> +		return false;
> +	file = READ_ONCE(sock->file);
> +	if (!file)
> +		return false;
> +	return file_ns_capable(file, &init_user_ns, CAP_NET_RAW);
>  }
>  
>  void skb_complete_tx_timestamp(struct sk_buff *skb,
> -- 
> 2.51.0
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-15  4:05 ` Willem de Bruijn
@ 2026-02-15 18:46   ` Sebastian Andrzej Siewior
  2026-02-16 16:47     ` Vadim Fedorenko
  2026-02-17  7:48     ` Jason Xing
  0 siblings, 2 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-15 18:46 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Kurt Kanzenbach, Paolo Abeni, Simon Horman, Willem de Bruijn

On 2026-02-14 23:05:27 [-0500], Willem de Bruijn wrote:
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5555,16 +5555,25 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
> >  
> >  static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
> >  {
> > -	bool ret;
> > +	struct socket *sock;
> > +	struct file *file;
> >  
> >  	if (likely(tsonly || READ_ONCE(sock_net(sk)->core.sysctl_tstamp_allow_data)))
> >  		return true;
> >  
> > -	read_lock_bh(&sk->sk_callback_lock);
> > -	ret = sk->sk_socket && sk->sk_socket->file &&
> > -	      file_ns_capable(sk->sk_socket->file, &init_user_ns, CAP_NET_RAW);
> > -	read_unlock_bh(&sk->sk_callback_lock);
> > -	return ret;
> > +	/* The sk pointer remains valid as long as the skb is. The sk_socket and
> > +	 * file pointer may become NULL if the socket is closed. Both structures
> > +	 * (including file->cred) are RCU freed which means they can be accessed
> > +	 * within a RCU read section.
> 
> The fields are not __rcu annotated. Is this common knowledge or is
> there clear documentation of this behavior? Not disputing it, just not
> familiar with the rules around these fields.

The pointer is not RCU annotated because normally, within a syscall, the
file descriptor does not vanish and so the pointer remains valid. The
interrupt callback is an exception because we don't hold a reference on
the fd. But since the underlying data structures have a RCU lifetime we
could use it.

sock_alloc() allocates a struct socket_alloc which contains the struct
socket and struct inode (sock_alloc_inode()). This the sk_socket in
struct sock. Once you close the socket via the close syscall and the
last reference is gone then you end up in __fput(). The invoked
->release() callback is sock_close(). Here socket::file gets NULLed and
proto::release should sock_orphan() the sock where sock::sk_socket gets
NULLed (I don't see where else and packet_release() does so). The
"struct sock" remains around because the skb has a reference on.
Otherwise it would go, too.

Once that is done __fput() does dput() -> __dentry_kill() -> iput() ->
destroy_inode() which does RCU free the inode.
That means if we observe the pointer non-NULL we can use it under the
RCU rules because the inode is released via RCU. 

->file. At the end of __fput() there is file_free() which releases the
pointer. file goes to filp_cachep which is SLAB_TYPESAFE_BY_RCU. There
is a put_cred for file::f_cred before that and it does call_rcu() at the
end, too.

This is me reading the code on FRI/ SAT and double checking it with
ptp4l. I might have missed something but so far, it makes sense.

> It does seem that updates to sk_socket, in sock_orphan and
> sock_graft, are protected by sk_callback_lock.
> > +	 */
> > +	lockdep_assert_in_rcu_reader();
> 
> This function can be called from process context outside an RCU read
> side section, I think.
> 
> Such as a packet socket sending and hitting the __skb_tstamp_tx
> SCM_TSTAMP_SCHED at the top of __dev_queue_xmit.

If that is the case, then we could create a RCU read section here.

> > +	sock = READ_ONCE(sk->sk_socket);
> > +	if (!sock)
> > +		return false;
> > +	file = READ_ONCE(sock->file);
> > +	if (!file)
> > +		return false;
> > +	return file_ns_capable(file, &init_user_ns, CAP_NET_RAW);
> >  }
> >  
> >  void skb_complete_tx_timestamp(struct sk_buff *skb,

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-15 18:46   ` Sebastian Andrzej Siewior
@ 2026-02-16 16:47     ` Vadim Fedorenko
  2026-02-16 18:08       ` Willem de Bruijn
  2026-02-17 12:13       ` Sebastian Andrzej Siewior
  2026-02-17  7:48     ` Jason Xing
  1 sibling, 2 replies; 12+ messages in thread
From: Vadim Fedorenko @ 2026-02-16 16:47 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Willem de Bruijn
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Kurt Kanzenbach, Paolo Abeni, Simon Horman, Willem de Bruijn

On 15/02/2026 18:46, Sebastian Andrzej Siewior wrote:
> On 2026-02-14 23:05:27 [-0500], Willem de Bruijn wrote:
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -5555,16 +5555,25 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
>>>   
>>>   static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
>>>   {
>>> -	bool ret;
>>> +	struct socket *sock;
>>> +	struct file *file;
>>>   
>>>   	if (likely(tsonly || READ_ONCE(sock_net(sk)->core.sysctl_tstamp_allow_data)))
>>>   		return true;
>>>   
>>> -	read_lock_bh(&sk->sk_callback_lock);
>>> -	ret = sk->sk_socket && sk->sk_socket->file &&
>>> -	      file_ns_capable(sk->sk_socket->file, &init_user_ns, CAP_NET_RAW);
>>> -	read_unlock_bh(&sk->sk_callback_lock);
>>> -	return ret;
>>> +	/* The sk pointer remains valid as long as the skb is. The sk_socket and
>>> +	 * file pointer may become NULL if the socket is closed. Both structures
>>> +	 * (including file->cred) are RCU freed which means they can be accessed
>>> +	 * within a RCU read section.
>>
>> The fields are not __rcu annotated. Is this common knowledge or is
>> there clear documentation of this behavior? Not disputing it, just not
>> familiar with the rules around these fields.
> 
> The pointer is not RCU annotated because normally, within a syscall, the
> file descriptor does not vanish and so the pointer remains valid. The
> interrupt callback is an exception because we don't hold a reference on
> the fd. But since the underlying data structures have a RCU lifetime we
> could use it.
> 
> sock_alloc() allocates a struct socket_alloc which contains the struct
> socket and struct inode (sock_alloc_inode()). This the sk_socket in
> struct sock. Once you close the socket via the close syscall and the
> last reference is gone then you end up in __fput(). The invoked
> ->release() callback is sock_close(). Here socket::file gets NULLed and
> proto::release should sock_orphan() the sock where sock::sk_socket gets
> NULLed (I don't see where else and packet_release() does so). The
> "struct sock" remains around because the skb has a reference on.
> Otherwise it would go, too.
> 
> Once that is done __fput() does dput() -> __dentry_kill() -> iput() ->
> destroy_inode() which does RCU free the inode.
> That means if we observe the pointer non-NULL we can use it under the
> RCU rules because the inode is released via RCU.
> 
> ->file. At the end of __fput() there is file_free() which releases the
> pointer. file goes to filp_cachep which is SLAB_TYPESAFE_BY_RCU. There
> is a put_cred for file::f_cred before that and it does call_rcu() at the
> end, too.
> 
> This is me reading the code on FRI/ SAT and double checking it with
> ptp4l. I might have missed something but so far, it makes sense.
> 
>> It does seem that updates to sk_socket, in sock_orphan and
>> sock_graft, are protected by sk_callback_lock.
>>> +	 */
>>> +	lockdep_assert_in_rcu_reader();
>>
>> This function can be called from process context outside an RCU read
>> side section, I think.
>>
>> Such as a packet socket sending and hitting the __skb_tstamp_tx
>> SCM_TSTAMP_SCHED at the top of __dev_queue_xmit.
> 
> If that is the case, then we could create a RCU read section here.

well, __dev_queue_xmit() grabs rcu_read_lock_bh() already, we just have
to move it right before timestamping check:

iff --git a/net/core/dev.c b/net/core/dev.c
index ac6bcb2a0784..e9a72da5a214 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4744,15 +4744,15 @@ int __dev_queue_xmit(struct sk_buff *skb, struct 
net_device *sb_dev)
         skb_reset_mac_header(skb);
         skb_assert_len(skb);

-       if (unlikely(skb_shinfo(skb)->tx_flags &
-                    (SKBTX_SCHED_TSTAMP | SKBTX_BPF)))
-               __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
-
         /* Disable soft irqs for various locks below. Also
          * stops preemption for RCU.
          */
         rcu_read_lock_bh();

+       if (unlikely(skb_shinfo(skb)->tx_flags &
+                    (SKBTX_SCHED_TSTAMP | SKBTX_BPF)))
+               __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
+
         skb_update_prio(skb);

         qdisc_pkt_len_segs_init(skb);

> 
>>> +	sock = READ_ONCE(sk->sk_socket);
>>> +	if (!sock)
>>> +		return false;
>>> +	file = READ_ONCE(sock->file);
>>> +	if (!file)
>>> +		return false;
>>> +	return file_ns_capable(file, &init_user_ns, CAP_NET_RAW);
>>>   }
>>>   
>>>   void skb_complete_tx_timestamp(struct sk_buff *skb,
> 
> Sebastian


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-16 16:47     ` Vadim Fedorenko
@ 2026-02-16 18:08       ` Willem de Bruijn
  2026-02-17 13:28         ` Sebastian Andrzej Siewior
  2026-02-17 12:13       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Willem de Bruijn @ 2026-02-16 18:08 UTC (permalink / raw)
  To: Vadim Fedorenko, Sebastian Andrzej Siewior, Willem de Bruijn
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Kurt Kanzenbach, Paolo Abeni, Simon Horman, Willem de Bruijn

Vadim Fedorenko wrote:
> On 15/02/2026 18:46, Sebastian Andrzej Siewior wrote:
> > On 2026-02-14 23:05:27 [-0500], Willem de Bruijn wrote:
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -5555,16 +5555,25 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
> >>>   
> >>>   static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
> >>>   {
> >>> -	bool ret;
> >>> +	struct socket *sock;
> >>> +	struct file *file;
> >>>   
> >>>   	if (likely(tsonly || READ_ONCE(sock_net(sk)->core.sysctl_tstamp_allow_data)))
> >>>   		return true;
> >>>   
> >>> -	read_lock_bh(&sk->sk_callback_lock);
> >>> -	ret = sk->sk_socket && sk->sk_socket->file &&
> >>> -	      file_ns_capable(sk->sk_socket->file, &init_user_ns, CAP_NET_RAW);
> >>> -	read_unlock_bh(&sk->sk_callback_lock);
> >>> -	return ret;
> >>> +	/* The sk pointer remains valid as long as the skb is. The sk_socket and
> >>> +	 * file pointer may become NULL if the socket is closed. Both structures
> >>> +	 * (including file->cred) are RCU freed which means they can be accessed
> >>> +	 * within a RCU read section.
> >>
> >> The fields are not __rcu annotated. Is this common knowledge or is
> >> there clear documentation of this behavior? Not disputing it, just not
> >> familiar with the rules around these fields.
> > 
> > The pointer is not RCU annotated because normally, within a syscall, the
> > file descriptor does not vanish and so the pointer remains valid. The
> > interrupt callback is an exception because we don't hold a reference on
> > the fd. But since the underlying data structures have a RCU lifetime we
> > could use it.
> > 
> > sock_alloc() allocates a struct socket_alloc which contains the struct
> > socket and struct inode (sock_alloc_inode()). This the sk_socket in
> > struct sock. Once you close the socket via the close syscall and the
> > last reference is gone then you end up in __fput(). The invoked
> > ->release() callback is sock_close(). Here socket::file gets NULLed and
> > proto::release should sock_orphan() the sock where sock::sk_socket gets
> > NULLed (I don't see where else and packet_release() does so). The
> > "struct sock" remains around because the skb has a reference on.
> > Otherwise it would go, too.
> > 
> > Once that is done __fput() does dput() -> __dentry_kill() -> iput() ->
> > destroy_inode() which does RCU free the inode.
> > That means if we observe the pointer non-NULL we can use it under the
> > RCU rules because the inode is released via RCU.
> > 
> > ->file. At the end of __fput() there is file_free() which releases the
> > pointer. file goes to filp_cachep which is SLAB_TYPESAFE_BY_RCU. There
> > is a put_cred for file::f_cred before that and it does call_rcu() at the
> > end, too.
> > 
> > This is me reading the code on FRI/ SAT and double checking it with
> > ptp4l. I might have missed something but so far, it makes sense.
> > 
> >> It does seem that updates to sk_socket, in sock_orphan and
> >> sock_graft, are protected by sk_callback_lock.
> >>> +	 */
> >>> +	lockdep_assert_in_rcu_reader();

It will be great to have this additional context in the commit
message.

And a Link: to the original discussion thread.

Btw, the commit states that a few drivers already call this from their
hard interrupt handler. Can you point to one or two?

If this is only for new drivers, it would be vastly preferable if this
can go through net-next.

> >>
> >> This function can be called from process context outside an RCU read
> >> side section, I think.
> >>
> >> Such as a packet socket sending and hitting the __skb_tstamp_tx
> >> SCM_TSTAMP_SCHED at the top of __dev_queue_xmit.
> > 
> > If that is the case, then we could create a RCU read section here.
> 
> well, __dev_queue_xmit() grabs rcu_read_lock_bh() already, we just have
> to move it right before timestamping check:

True. That was just the first example I found though.

I did not check tcp_ack_tstamp, or whether any drivers defer the
operation to a process context kthread without RCU protection.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-15 18:46   ` Sebastian Andrzej Siewior
  2026-02-16 16:47     ` Vadim Fedorenko
@ 2026-02-17  7:48     ` Jason Xing
  2026-02-17 14:44       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Xing @ 2026-02-17  7:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Willem de Bruijn, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Kurt Kanzenbach, Paolo Abeni, Simon Horman,
	Willem de Bruijn

Hi Sebastian,

On Mon, Feb 16, 2026 at 2:46 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2026-02-14 23:05:27 [-0500], Willem de Bruijn wrote:
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5555,16 +5555,25 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
> > >
> > >  static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
> > >  {
> > > -   bool ret;
> > > +   struct socket *sock;
> > > +   struct file *file;
> > >
> > >     if (likely(tsonly || READ_ONCE(sock_net(sk)->core.sysctl_tstamp_allow_data)))
> > >             return true;
> > >
> > > -   read_lock_bh(&sk->sk_callback_lock);
> > > -   ret = sk->sk_socket && sk->sk_socket->file &&
> > > -         file_ns_capable(sk->sk_socket->file, &init_user_ns, CAP_NET_RAW);
> > > -   read_unlock_bh(&sk->sk_callback_lock);
> > > -   return ret;
> > > +   /* The sk pointer remains valid as long as the skb is. The sk_socket and
> > > +    * file pointer may become NULL if the socket is closed. Both structures
> > > +    * (including file->cred) are RCU freed which means they can be accessed
> > > +    * within a RCU read section.
> >
> > The fields are not __rcu annotated. Is this common knowledge or is
> > there clear documentation of this behavior? Not disputing it, just not
> > familiar with the rules around these fields.
>
> The pointer is not RCU annotated because normally, within a syscall, the
> file descriptor does not vanish and so the pointer remains valid. The
> interrupt callback is an exception because we don't hold a reference on
> the fd. But since the underlying data structures have a RCU lifetime we
> could use it.
>
> sock_alloc() allocates a struct socket_alloc which contains the struct
> socket and struct inode (sock_alloc_inode()). This the sk_socket in
> struct sock. Once you close the socket via the close syscall and the
> last reference is gone then you end up in __fput(). The invoked
> ->release() callback is sock_close(). Here socket::file gets NULLed and
> proto::release should sock_orphan() the sock where sock::sk_socket gets
> NULLed (I don't see where else and packet_release() does so). The
> "struct sock" remains around because the skb has a reference on.
> Otherwise it would go, too.

If skb still exists, there is no way to do the real destruction of its
socket, at least I know it works for TCP. That means IIUC TCP doesn't
have such an issue?

I roughly know what you meant here, but still wondering if you can
provide a specific call trace or case where the socket using socket
timestamping feature leads to this NULL-pointer problem?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-14 23:24 [PATCH net] net: Drop the lock in skb_may_tx_timestamp() Sebastian Andrzej Siewior
  2026-02-15  4:05 ` Willem de Bruijn
@ 2026-02-17  8:01 ` Eric Dumazet
  2026-02-17 14:45   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2026-02-17  8:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netdev, David S. Miller, Jakub Kicinski, Kurt Kanzenbach,
	Paolo Abeni, Simon Horman, Willem de Bruijn

On Sun, Feb 15, 2026 at 12:25 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must
> not be taken in IRQ context, only softirq is okay. A few drivers receive
> the timestamp via a dedicated interrupt and complete the TX timestamp
> from that handler. This will lead to a deadlock if the lock is already
> write-locked on the same CPU.
>
> Taking the lock can be avoided. The socket (pointed by the skb) will
> remain valid until the skb is released. The ->sk_socket and ->file
> member will be set to NULL once the user closes the socket which may
> happen before the timestamp arrives.
> If we happen to observe the pointer while the socket is closing but
> before the pointer is set to NULL then we may use it because both
> pointer (and the file's cred member) are RCU freed and the IRQ and
> softirq handler qualify as a RCU-read section.
>
> Fixes: b245be1f4db1a ("net-timestamp: no-payload only sysctl")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/skbuff.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 61746c2b95f63..a174010e334a1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5555,16 +5555,25 @@ static void __skb_complete_tx_timestamp(struct sk_buff *skb,
>
>  static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly)
>  {
> -       bool ret;
> +       struct socket *sock;
> +       struct file *file;
>
>         if (likely(tsonly || READ_ONCE(sock_net(sk)->core.sysctl_tstamp_allow_data)))
>                 return true;
>
> -       read_lock_bh(&sk->sk_callback_lock);
> -       ret = sk->sk_socket && sk->sk_socket->file &&
> -             file_ns_capable(sk->sk_socket->file, &init_user_ns, CAP_NET_RAW);
> -       read_unlock_bh(&sk->sk_callback_lock);
> -       return ret;
> +       /* The sk pointer remains valid as long as the skb is. The sk_socket and
> +        * file pointer may become NULL if the socket is closed. Both structures
> +        * (including file->cred) are RCU freed which means they can be accessed
> +        * within a RCU read section.
> +        */
> +       lockdep_assert_in_rcu_reader();
> +       sock = READ_ONCE(sk->sk_socket);
> +       if (!sock)
> +               return false;
> +       file = READ_ONCE(sock->file);
> +       if (!file)
> +               return false;
> +       return file_ns_capable(file, &init_user_ns, CAP_NET_RAW);
>  }

I think this should work, but can you also add this in your next version:

Thanks !

diff --git a/include/net/sock.h b/include/net/sock.h
index 66b56288c1d3850439b2a0bed00be801d5770efa..6c9a83016e9551ed2e2a0d7edf32300b8a4327e7
100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2098,7 +2098,7 @@ static inline int sk_rx_queue_get(const struct sock *sk)

 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
-       sk->sk_socket = sock;
+       WRITE_ONCE(sk->sk_socket, sock);
        if (sock) {
                WRITE_ONCE(sk->sk_uid, SOCK_INODE(sock)->i_uid);
                WRITE_ONCE(sk->sk_ino, SOCK_INODE(sock)->i_ino);
diff --git a/net/socket.c b/net/socket.c
index 136b98c54fb375e2065df861dd670bdb014e3da4..05952188127f5bce41494e99e86418249e9022f2
100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -674,7 +674,7 @@ static void __sock_release(struct socket *sock,
struct inode *inode)
                iput(SOCK_INODE(sock));
                return;
        }
-       sock->file = NULL;
+       WRITE_ONCE(sock->file, NULL);
 }

 /**

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-16 16:47     ` Vadim Fedorenko
  2026-02-16 18:08       ` Willem de Bruijn
@ 2026-02-17 12:13       ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17 12:13 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Willem de Bruijn, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Kurt Kanzenbach, Paolo Abeni, Simon Horman,
	Willem de Bruijn

On 2026-02-16 16:47:03 [+0000], Vadim Fedorenko wrote:
> > > Such as a packet socket sending and hitting the __skb_tstamp_tx
> > > SCM_TSTAMP_SCHED at the top of __dev_queue_xmit.
> > 
> > If that is the case, then we could create a RCU read section here.
> 
> well, __dev_queue_xmit() grabs rcu_read_lock_bh() already, we just have
> to move it right before timestamping check:

For __skb_tstamp_tx() there is:
- hci_conn_tx_queue() -> __skb_tstamp_tx() which appears to be called
  from a worker and has a GPF_KERNEL allocation later.
- hci_num_comp_pkts_evt() (does acquire a mutex) ->
  hci_conn_tx_dequeue() -> __skb_tstamp_tx()
- prueth_rx_mgm_ts_thread_sr1() (explicit threaded IRQ) ->
  prueth_tx_ts_sr1() -> skb_tstamp_tx()

There is also skb_complete_tx_timestamp():
- ines_txtstamp_work() -> skb_complete_tx_timestamp()
- nxp_c45_do_aux_work() -> nxp_c45_process_txts() ->
  skb_complete_tx_timestamp()
- ksz_ptp_txtstamp_skb() -> skb_complete_tx_timestamp()
  and ksz_ptp_txtstamp_skb() has a wait_for_completion_timeout() so no
  RCU here either.
- hellcreek_txtstamp_work() -> skb_complete_tx_timestamp()

so maybe an explicit RCU region is easier.

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-16 18:08       ` Willem de Bruijn
@ 2026-02-17 13:28         ` Sebastian Andrzej Siewior
  2026-02-17 14:14           ` Willem de Bruijn
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17 13:28 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Vadim Fedorenko, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Kurt Kanzenbach, Paolo Abeni, Simon Horman,
	Willem de Bruijn

On 2026-02-16 13:08:43 [-0500], Willem de Bruijn wrote:
> It will be great to have this additional context in the commit
> message.
> 
> And a Link: to the original discussion thread.

Okay.

> Btw, the commit states that a few drivers already call this from their
> hard interrupt handler. Can you point to one or two?

- bxcan_tx_isr() -> can_get_echo_skb() -> __can_get_echo_skb() ->
  skb_tstamp_tx()

- at91_irq() -> at91_irq_tx() -> can_get_echo_skb() -> … ->
  skb_tstamp_tx()

- hclge_misc_irq_handle() -> hclge_ptp_clean_tx_hwts() ->
  skb_tstamp_tx()

- igc_intr_msi() -> igc_tsync_interrupt() -> igc_ptp_tx_tstamp_event()
  -> igc_ptp_tx_hwtstamp() -> igc_ptp_tx_reg_to_stamp() ->
  skb_tstamp_tx()

> If this is only for new drivers, it would be vastly preferable if this
> can go through net-next.

I found three, I knew about igc. I don't mind net-next but if we keep
the stable tag, it will be backported anyway, it will just take longer.
But if we want to do this in order to catch problems with it before it
hits stable then it makes sense.

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-17 13:28         ` Sebastian Andrzej Siewior
@ 2026-02-17 14:14           ` Willem de Bruijn
  0 siblings, 0 replies; 12+ messages in thread
From: Willem de Bruijn @ 2026-02-17 14:14 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Willem de Bruijn
  Cc: Vadim Fedorenko, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Kurt Kanzenbach, Paolo Abeni, Simon Horman,
	Willem de Bruijn

Sebastian Andrzej Siewior wrote:
> On 2026-02-16 13:08:43 [-0500], Willem de Bruijn wrote:
> > It will be great to have this additional context in the commit
> > message.
> > 
> > And a Link: to the original discussion thread.
> 
> Okay.
> 
> > Btw, the commit states that a few drivers already call this from their
> > hard interrupt handler. Can you point to one or two?
> 
> - bxcan_tx_isr() -> can_get_echo_skb() -> __can_get_echo_skb() ->
>   skb_tstamp_tx()
> 
> - at91_irq() -> at91_irq_tx() -> can_get_echo_skb() -> … ->
>   skb_tstamp_tx()
> 
> - hclge_misc_irq_handle() -> hclge_ptp_clean_tx_hwts() ->
>   skb_tstamp_tx()
> 
> - igc_intr_msi() -> igc_tsync_interrupt() -> igc_ptp_tx_tstamp_event()
>   -> igc_ptp_tx_hwtstamp() -> igc_ptp_tx_reg_to_stamp() ->
>   skb_tstamp_tx()
> 
> > If this is only for new drivers, it would be vastly preferable if this
> > can go through net-next.
> 
> I found three, I knew about igc. I don't mind net-next but if we keep
> the stable tag, it will be backported anyway, it will just take longer.
> But if we want to do this in order to catch problems with it before it
> hits stable then it makes sense.

These are more than enough cases to warrant net. Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-17  7:48     ` Jason Xing
@ 2026-02-17 14:44       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17 14:44 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Kurt Kanzenbach, Paolo Abeni, Simon Horman,
	Willem de Bruijn

On 2026-02-17 15:48:11 [+0800], Jason Xing wrote:
> Hi Sebastian,
Hi,

> > sock_alloc() allocates a struct socket_alloc which contains the struct
> > socket and struct inode (sock_alloc_inode()). This the sk_socket in
> > struct sock. Once you close the socket via the close syscall and the
> > last reference is gone then you end up in __fput(). The invoked
> > ->release() callback is sock_close(). Here socket::file gets NULLed and
> > proto::release should sock_orphan() the sock where sock::sk_socket gets
> > NULLed (I don't see where else and packet_release() does so). The
> > "struct sock" remains around because the skb has a reference on.
> > Otherwise it would go, too.
> 
> If skb still exists, there is no way to do the real destruction of its
> socket, at least I know it works for TCP. That means IIUC TCP doesn't
> have such an issue?

There is struct sock and struct socket. socket will be removed once the
fd goes. sock remains until the last skb is gone.

> I roughly know what you meant here, but still wondering if you can
> provide a specific call trace or case where the socket using socket
> timestamping feature leads to this NULL-pointer problem?

So assuming we delay the timestamp delivery and as soon as ptp4l
complains, we kill it. Then we see this on ptp4l's side:
| ptp4l[89.480]: port 1 (enp4s0): delay timeout
| ptp4l[89.490]: timed out while polling for tx timestamp
| ptp4l[89.490]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
|CTRL-C

and delay delivery in igc:

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 89a321a344d2..5bfd4a442b16 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -1546,6 +1546,7 @@ static bool igc_request_tx_tstamp(struct igc_adapter *adapter, struct sk_buff *s
 		if (tstamp->skb)
 			continue;
 
+		trace_printk("skb %px dest %pS\n", skb, skb->destructor);
 		tstamp->skb = skb_get(skb);
 		tstamp->start = jiffies;
 		*flags = tstamp->flags;
@@ -5563,6 +5564,21 @@ igc_features_check(struct sk_buff *skb, struct net_device *dev,
 	return features;
 }
 
+struct delayed_ptp {
+	struct delayed_work dw;
+	struct igc_adapter *adapter;
+};
+
+static void delayed_ts(struct work_struct *work)
+{
+	struct delayed_ptp *dtp = container_of(work, struct delayed_ptp, dw.work);
+
+	/* retrieve hardware timestamp */
+	trace_printk("\n");
+	igc_ptp_tx_tstamp_event(dtp->adapter);
+	kfree(dtp);
+}
+
 static void igc_tsync_interrupt(struct igc_adapter *adapter)
 {
 	struct igc_hw *hw = &adapter->hw;
@@ -5579,8 +5595,15 @@ static void igc_tsync_interrupt(struct igc_adapter *adapter)
 	}
 
 	if (tsicr & IGC_TSICR_TXTS) {
-		/* retrieve hardware timestamp */
-		igc_ptp_tx_tstamp_event(adapter);
+		struct delayed_ptp *dtp;
+
+		dtp = kmalloc(sizeof(*dtp), GFP_ATOMIC);
+		if (dtp) {
+			INIT_DELAYED_WORK(&dtp->dw, delayed_ts);
+			dtp->adapter = adapter;
+			trace_printk("\n");
+			schedule_delayed_work(&dtp->dw, HZ * 10);
+		}
 	}
 
 	if (tsicr & IGC_TSICR_TT0) {

and few more in net/, I end up with:

|          ptp4l-745     [005] .....    88.609323: packet_create: sock ffff8ba689e28680 sk ffff8ba6822f8800
|          ptp4l-745     [005] .....    88.609325: sock_alloc_file: sock ffff8ba689e28680 sk ffff8ba6822f8800 file ffff8ba689575d80
socket allocation

|         <idle>-0       [004] dNh1.    89.645405: igc_tsync_interrupt:
TX TS delayed

|          ptp4l-745     [005] .....    89.655483: __sock_release: packet_release+0x0/0x490 sock ffff8ba689e28680, file is NULL
|          ptp4l-745     [005] .....    89.655484: packet_release: sock ffff8ba689e28680 sk ffff8ba6822f8800
|          ptp4l-745     [005] .....    89.669111: packet_release: gone ffff8ba689e28680

ptp4l closed socket, sock->sk = sk->sk_socket = NULL.

|    kworker/0:1-11      [000] .....    99.769061: delayed_ts:
The worked kicked in

|    kworker/0:1-11      [000] d..1.    99.769068: __skb_tstamp_tx: sk ffff8ba6822f8800 socket 0000000000000000
the NULL socket.

|    kworker/0:1-11      [000] d..1.    99.769070: __skb_complete_tx_timestamp: added to ffff8ba6822f8800
|    ksoftirqd/0-14      [000] ..s..    99.769081: __sk_destruct: sk ffff8ba6822f8800
|    ksoftirqd/0-14      [000] ..s..    99.769082: packet_sock_destruct: sk ffff8ba6822f8800

This is the RCU free of sock after the skb was released.

Does this make sense?

> Thanks,
> Jason

Sebastian

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net] net: Drop the lock in skb_may_tx_timestamp()
  2026-02-17  8:01 ` Eric Dumazet
@ 2026-02-17 14:45   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17 14:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Jakub Kicinski, Kurt Kanzenbach,
	Paolo Abeni, Simon Horman, Willem de Bruijn

On 2026-02-17 09:01:53 [+0100], Eric Dumazet wrote:
> I think this should work, but can you also add this in your next version:

Will do, thanks.

> Thanks !

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-02-17 14:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-14 23:24 [PATCH net] net: Drop the lock in skb_may_tx_timestamp() Sebastian Andrzej Siewior
2026-02-15  4:05 ` Willem de Bruijn
2026-02-15 18:46   ` Sebastian Andrzej Siewior
2026-02-16 16:47     ` Vadim Fedorenko
2026-02-16 18:08       ` Willem de Bruijn
2026-02-17 13:28         ` Sebastian Andrzej Siewior
2026-02-17 14:14           ` Willem de Bruijn
2026-02-17 12:13       ` Sebastian Andrzej Siewior
2026-02-17  7:48     ` Jason Xing
2026-02-17 14:44       ` Sebastian Andrzej Siewior
2026-02-17  8:01 ` Eric Dumazet
2026-02-17 14:45   ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox