Netdev List
 help / color / mirror / Atom feed
* [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog
@ 2026-05-15  5:04 Zhang Cen
  2026-05-15  6:09 ` Jiayuan Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Zhang Cen @ 2026-05-15  5:04 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, bpf, linux-kernel, zerocling0077, 2045gemini, Zhang Cen

sk_psock_backlog() recovers the psock it operates on from the delayed
work item, but it takes its lifetime reference with
sk_psock_get(psock->sk).
That reloads sk->sk_user_data and can therefore return a replacement
psock after the old psock was detached and a new one was attached to
the same socket.

In that case the worker locks and drains the old psock, but the
reference it acquired belongs to the replacement psock. The exit path
then puts the detached old psock, which can underflow its refcount
after the last unlink while the replacement psock keeps the leaked
reference.

Take the reference on the delayed-work psock directly with
refcount_inc_not_zero(). If that fails, the old psock is already being
dropped, so skip the detached backlog instead of processing or putting
it. This keeps the worker's get/put pair on the same psock whose
work_state, ingress queue and state bits it manipulates.

The buggy scenario involves two paths, with each column showing the
order within that path:

path A label: detach and reattach path   path B label: old backlog worker
1. The last unlink drops the old         1. Delayed work resumes from the
   psock into sk_psock_drop().              old psock embedded in work.
2. sk_psock_drop() clears               2. The worker still sees
   sk->sk_user_data before the old         SK_PSOCK_TX_ENABLED on that
   TX state is cleared.                    old psock.
3. A new attach publishes a             3. sk_psock_get(psock->sk)
   replacement psock on the same           reloads sk->sk_user_data and
   socket.                                 refs the replacement psock.
4. The old psock is still queued for    4. The worker locks, drains and
   delayed backlog work.                   finally puts the detached old
                                           psock.

Sanitizer validation reported:
Non-fatal target warning: refcount_t underflow/use-after-free warning from refcount_warn_saturate triggered by sk_psock_backlog putting the detached old psock after last_old_ref_before_put reached 0.
use-after-free

Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>

---
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -684,12 +684,12 @@ static void sk_psock_backlog(struct work_struct *work)
 	if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
 		return;
 
-	/* Increment the psock refcnt to synchronize with close(fd) path in
-	 * sock_map_close(), ensuring we wait for backlog thread completion
-	 * before sk_socket freed. If refcnt increment fails, it indicates
-	 * sock_map_close() completed with sk_socket potentially already freed.
+	/* Hold the delayed-work psock itself so teardown synchronizes with
+	 * the same object whose work_state, queues and state bits we touch.
+	 * If the refcnt is already zero, this psock is being dropped and its
+	 * detached backlog must no longer run.
 	 */
-	if (!sk_psock_get(psock->sk))
+	if (!refcount_inc_not_zero(&psock->refcnt))
 		return;
 	mutex_lock(&psock->work_mutex);
 	while ((skb = skb_peek(&psock->ingress_skb))) {

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog
  2026-05-15  5:04 [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog Zhang Cen
@ 2026-05-15  6:09 ` Jiayuan Chen
  2026-05-15  8:12   ` Cen Zhang
  0 siblings, 1 reply; 5+ messages in thread
From: Jiayuan Chen @ 2026-05-15  6:09 UTC (permalink / raw)
  To: Zhang Cen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, bpf, linux-kernel, zerocling0077, 2045gemini


On 5/15/26 1:04 PM, Zhang Cen wrote:
> sk_psock_backlog() recovers the psock it operates on from the delayed
> work item, but it takes its lifetime reference with
> sk_psock_get(psock->sk).
> That reloads sk->sk_user_data and can therefore return a replacement
> psock after the old psock was detached and a new one was attached to
> the same socket.
>
> In that case the worker locks and drains the old psock, but the
> reference it acquired belongs to the replacement psock. The exit path
> then puts the detached old psock, which can underflow its refcount
> after the last unlink while the replacement psock keeps the leaked
> reference.
>
> Take the reference on the delayed-work psock directly with
> refcount_inc_not_zero(). If that fails, the old psock is already being
> dropped, so skip the detached backlog instead of processing or putting
> it. This keeps the worker's get/put pair on the same psock whose
> work_state, ingress queue and state bits it manipulates.
>
> The buggy scenario involves two paths, with each column showing the
> order within that path:
>
> path A label: detach and reattach path   path B label: old backlog worker
> 1. The last unlink drops the old         1. Delayed work resumes from the
>     psock into sk_psock_drop().              old psock embedded in work.
> 2. sk_psock_drop() clears               2. The worker still sees
>     sk->sk_user_data before the old         SK_PSOCK_TX_ENABLED on that
>     TX state is cleared.                    old psock.
> 3. A new attach publishes a             3. sk_psock_get(psock->sk)
>     replacement psock on the same           reloads sk->sk_user_data and
>     socket.                                 refs the replacement psock.
> 4. The old psock is still queued for    4. The worker locks, drains and
>     delayed backlog work.                   finally puts the detached old
>                                             psock.
>
> Sanitizer validation reported:
> Non-fatal target warning: refcount_t underflow/use-after-free warning from refcount_warn_saturate triggered by sk_psock_backlog putting the detached old psock after last_old_ref_before_put reached 0.

Where is the 'last_old_ref_before_put' symbol from? I can't find it 
anywhere in the tree.

If you are using LLMs to dig into races like this, please also have them 
produce a reproducer, e.g. patch mdelay() into

the relevant windows to widen them, then trigger it from userspace.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog
  2026-05-15  6:09 ` Jiayuan Chen
@ 2026-05-15  8:12   ` Cen Zhang
  2026-05-15  8:26     ` Jiayuan Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Cen Zhang @ 2026-05-15  8:12 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, bpf,
	linux-kernel, zerocling0077, 2045gemini

Dear Jiayuan Chen

Jiayuan Chen <jiayuan.chen@linux.dev> 于2026年5月15日周五 14:10写道:
> Where is the 'last_old_ref_before_put' symbol from? I can't find it
> anywhere in the tree.
>
> If you are using LLMs to dig into races like this, please also have them
> produce a reproducer, e.g. patch mdelay() into
>
> the relevant windows to widen them, then trigger it from userspace.
>
>

Hi Jiayuan,

Thanks for checking this. You are right: last_old_ref_before_put is
not an in-tree kernel symbol. It was a temporary validation probe
label which recorded the old psock refcount immediately before the
backlog worker's final put, and it should not have appeared in the
commit message as if it were kernel output.

The in-tree path I was trying to describe is:

  sk_psock_backlog() starts at net/core/skmsg.c:670.
  get path: sk_psock_get(psock->sk), net/core/skmsg.c:692.
  put path: sk_psock_put(psock->sk, psock), net/core/skmsg.c:746.
  detach clears sk_user_data at net/core/skmsg.c:892.
  reattach publishes a replacement psock at net/core/skmsg.c:793.
  warning path: REFCOUNT_SUB_UAF at lib/refcount.c:28.

The trigger was based on the in-tree sockmap_redir BPF selftest
under tools/testing/selftests/bpf/prog_tests/.
The one-shot test used AF_UNIX SOCK_STREAM socket pairs, attached
the sk_skb verdict program to the input map, inserted one socket
into the input map and one destination socket into the sockmap at
key 0, then sent one byte through the input peer so the destination
psock backlog worker was queued.
For validation I used a temporary local instrumentation patch in
net/core/skmsg.c. It added a debugfs-controlled gate in
sk_psock_backlog() after the TX-enabled check and before the
existing sk_psock_get(psock->sk) call, plus counters and pr_info()
snapshots in sk_psock_backlog(), sk_psock_init() and
sk_psock_drop(). It also stored the pointer returned by
sk_psock_get(psock->sk) for logging. The worker still used the
existing get path and the existing sk_psock_put(psock->sk, psock)
exit path.
With the worker parked before sk_psock_get(psock->sk), the test
forked: the child deleted the destination sockmap entry, and the
parent retried BPF_NOEXIST update of the same key with the same
destination socket fd until reattach succeeded.
After the delete completed, the test released the old worker. At
that point sk->sk_user_data referred to the replacement psock, while
the delayed work still belonged to the old psock. The recorded state
before the warning had the sk_user_data psock and the psock returned
by sk_psock_get(psock->sk) equal to each other, but different from
the delayed-work container. The instrumentation was only used to make
that interleaving deterministic and observable. The warning below is
the kernel's normal refcount warning path.

The native kernel report from that run was:

  refcount_t: underflow; use-after-free.
  WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xbf/0xf0
  Workqueue: events sk_psock_backlog
  RIP: 0010:refcount_warn_saturate+0xbf/0xf0
  Call trace:
    sk_psock_backlog() (net/core/skmsg.c:670)
    process_one_work() (kernel/workqueue.c:3200)

So the reproducer is instrumentation-assisted, not an
unmodified upstream selftest. The instrumentation can widen the
race window and record the participating psock pointers, but it
does not publish a replacement psock, clear sk->sk_user_data, or
add an extra put on the old psock. The final warning is reached
through the existing sk_psock_put(psock->sk, psock) path after
the test has forced delete-plus-reattach to happen before the
parked worker resumes.

I will send v2 as a new thread after the netdev 24-hour
interval, with the lab probe label removed from the commit text.
If useful, I can also share the small instrumentation/selftest
diff separately to show the exact widened window.

Thanks,
Zhang Cen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog
  2026-05-15  8:12   ` Cen Zhang
@ 2026-05-15  8:26     ` Jiayuan Chen
  2026-05-15  8:54       ` Jiayuan Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Jiayuan Chen @ 2026-05-15  8:26 UTC (permalink / raw)
  To: Cen Zhang
  Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, bpf,
	linux-kernel, zerocling0077, 2045gemini


On 5/15/26 4:12 PM, Cen Zhang wrote:
> Dear Jiayuan Chen
>
> Jiayuan Chen <jiayuan.chen@linux.dev> 于2026年5月15日周五 14:10写道:
>> Where is the 'last_old_ref_before_put' symbol from? I can't find it
>> anywhere in the tree.
>>
>> If you are using LLMs to dig into races like this, please also have them
>> produce a reproducer, e.g. patch mdelay() into
>>
>> the relevant windows to widen them, then trigger it from userspace.
>>
>>
> Hi Jiayuan,
>
> Thanks for checking this. You are right: last_old_ref_before_put is
> not an in-tree kernel symbol. It was a temporary validation probe
> label which recorded the old psock refcount immediately before the
> backlog worker's final put, and it should not have appeared in the
> commit message as if it were kernel output.
>
> The in-tree path I was trying to describe is:
>
>    sk_psock_backlog() starts at net/core/skmsg.c:670.
>    get path: sk_psock_get(psock->sk), net/core/skmsg.c:692.
>    put path: sk_psock_put(psock->sk, psock), net/core/skmsg.c:746.
>    detach clears sk_user_data at net/core/skmsg.c:892.
>    reattach publishes a replacement psock at net/core/skmsg.c:793.
>    warning path: REFCOUNT_SUB_UAF at lib/refcount.c:28.
>
> The trigger was based on the in-tree sockmap_redir BPF selftest
> under tools/testing/selftests/bpf/prog_tests/.
> The one-shot test used AF_UNIX SOCK_STREAM socket pairs, attached
> the sk_skb verdict program to the input map, inserted one socket
> into the input map and one destination socket into the sockmap at
> key 0, then sent one byte through the input peer so the destination
> psock backlog worker was queued.
> For validation I used a temporary local instrumentation patch in
> net/core/skmsg.c. It added a debugfs-controlled gate in
> sk_psock_backlog() after the TX-enabled check and before the
> existing sk_psock_get(psock->sk) call, plus counters and pr_info()
> snapshots in sk_psock_backlog(), sk_psock_init() and
> sk_psock_drop(). It also stored the pointer returned by
> sk_psock_get(psock->sk) for logging. The worker still used the
> existing get path and the existing sk_psock_put(psock->sk, psock)
> exit path.
> With the worker parked before sk_psock_get(psock->sk), the test
> forked: the child deleted the destination sockmap entry, and the
> parent retried BPF_NOEXIST update of the same key with the same
> destination socket fd until reattach succeeded.
> After the delete completed, the test released the old worker. At
> that point sk->sk_user_data referred to the replacement psock, while

So, should the fix swap the order of sk->sk_user_data = null and 
sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED)?


> the delayed work still belonged to the old psock. The recorded state
> before the warning had the sk_user_data psock and the psock returned
> by sk_psock_get(psock->sk) equal to each other, but different from
> the delayed-work container. The instrumentation was only used to make
> that interleaving deterministic and observable. The warning below is
> the kernel's normal refcount warning path.
>
> The native kernel report from that run was:
>
>    refcount_t: underflow; use-after-free.
>    WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xbf/0xf0
>    Workqueue: events sk_psock_backlog
>    RIP: 0010:refcount_warn_saturate+0xbf/0xf0
>    Call trace:
>      sk_psock_backlog() (net/core/skmsg.c:670)
>      process_one_work() (kernel/workqueue.c:3200)
>
> So the reproducer is instrumentation-assisted, not an
> unmodified upstream selftest. The instrumentation can widen the
> race window and record the participating psock pointers, but it
> does not publish a replacement psock, clear sk->sk_user_data, or
> add an extra put on the old psock. The final warning is reached
> through the existing sk_psock_put(psock->sk, psock) path after
> the test has forced delete-plus-reattach to happen before the
> parked worker resumes.
>
> I will send v2 as a new thread after the netdev 24-hour
> interval, with the lab probe label removed from the commit text.
> If useful, I can also share the small instrumentation/selftest
> diff separately to show the exact widened window.

You can just put the kernel patch and userspace program patch in this 
thread (no need to send a new patch).

Also this patch should be targeted to bpf not net.

--
pw-bot: cr




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog
  2026-05-15  8:26     ` Jiayuan Chen
@ 2026-05-15  8:54       ` Jiayuan Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Jiayuan Chen @ 2026-05-15  8:54 UTC (permalink / raw)
  To: Cen Zhang, bpf@vger.kernel.org
  Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, bpf,
	linux-kernel, zerocling0077, 2045gemini


On 5/15/26 4:26 PM, Jiayuan Chen wrote:
>
> On 5/15/26 4:12 PM, Cen Zhang wrote:
>> Dear Jiayuan Chen
>>
>> Jiayuan Chen <jiayuan.chen@linux.dev> 于2026年5月15日周五 14:10写道:
>>> Where is the 'last_old_ref_before_put' symbol from? I can't find it
>>> anywhere in the tree.
>>>
>>> If you are using LLMs to dig into races like this, please also have 
>>> them
>>> produce a reproducer, e.g. patch mdelay() into
>>>
>>> the relevant windows to widen them, then trigger it from userspace.
>>>
>>>
>> Hi Jiayuan,
>>
>> Thanks for checking this. You are right: last_old_ref_before_put is
>> not an in-tree kernel symbol. It was a temporary validation probe
>> label which recorded the old psock refcount immediately before the
>> backlog worker's final put, and it should not have appeared in the
>> commit message as if it were kernel output.
>>
>> The in-tree path I was trying to describe is:
>>
>>    sk_psock_backlog() starts at net/core/skmsg.c:670.
>>    get path: sk_psock_get(psock->sk), net/core/skmsg.c:692.
>>    put path: sk_psock_put(psock->sk, psock), net/core/skmsg.c:746.
>>    detach clears sk_user_data at net/core/skmsg.c:892.
>>    reattach publishes a replacement psock at net/core/skmsg.c:793.
>>    warning path: REFCOUNT_SUB_UAF at lib/refcount.c:28.
>>
>> The trigger was based on the in-tree sockmap_redir BPF selftest
>> under tools/testing/selftests/bpf/prog_tests/.
>> The one-shot test used AF_UNIX SOCK_STREAM socket pairs, attached
>> the sk_skb verdict program to the input map, inserted one socket
>> into the input map and one destination socket into the sockmap at
>> key 0, then sent one byte through the input peer so the destination
>> psock backlog worker was queued.
>> For validation I used a temporary local instrumentation patch in
>> net/core/skmsg.c. It added a debugfs-controlled gate in
>> sk_psock_backlog() after the TX-enabled check and before the
>> existing sk_psock_get(psock->sk) call, plus counters and pr_info()
>> snapshots in sk_psock_backlog(), sk_psock_init() and
>> sk_psock_drop(). It also stored the pointer returned by
>> sk_psock_get(psock->sk) for logging. The worker still used the
>> existing get path and the existing sk_psock_put(psock->sk, psock)
>> exit path.
>> With the worker parked before sk_psock_get(psock->sk), the test
>> forked: the child deleted the destination sockmap entry, and the
>> parent retried BPF_NOEXIST update of the same key with the same
>> destination socket fd until reattach succeeded.
>> After the delete completed, the test released the old worker. At
>> that point sk->sk_user_data referred to the replacement psock, while
>
> So, should the fix swap the order of sk->sk_user_data = null and 
> sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED)?
>
>

Please also carry a Fixes tag.


[...]
>>
>> I will send v2 as a new thread after the netdev 24-hour
>> interval, with the lab probe label removed from the commit text.
>> If useful, I can also share the small instrumentation/selftest
>> diff separately to show the exact widened window.
>
> You can just put the kernel patch and userspace program patch in this 
> thread (no need to send a new patch).
>
> Also this patch should be targeted to bpf not net.
>
> -- 
> pw-bot: cr
>
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-15  8:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15  5:04 [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog Zhang Cen
2026-05-15  6:09 ` Jiayuan Chen
2026-05-15  8:12   ` Cen Zhang
2026-05-15  8:26     ` Jiayuan Chen
2026-05-15  8:54       ` Jiayuan Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox