* [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
@ 2024-05-10 9:39 Kuniyuki Iwashima
2024-05-10 10:44 ` Paolo Abeni
2024-05-12 14:47 ` Michal Luczaj
0 siblings, 2 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2024-05-10 9:39 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev,
Billy Jheng Bing-Jhong
Billy Jheng Bing-Jhong reported a race between __unix_gc() and
queue_oob().
__unix_gc() tries to garbage-collect close()d inflight sockets,
and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
will drop the reference and set NULL to it locklessly.
However, the peer socket still can send MSG_OOB message and
queue_oob() can update unix_sk(sk)->oob_skb concurrently, leading
NULL pointer dereference. [0]
To fix the issue, let's update unix_sk(sk)->oob_skb under the
sk_receive_queue's lock and take it everywhere we touch oob_skb.
Note that the same issue exists in the new GC, and the change
in queue_oob() can be applied as is.
[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events delayed_fput
RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847)
Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc
RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9
RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00
RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001
R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00
R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80
FS: 0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
unix_release_sock (net/unix/af_unix.c:654)
unix_release (net/unix/af_unix.c:1050)
__sock_release (net/socket.c:660)
sock_close (net/socket.c:1423)
__fput (fs/file_table.c:423)
delayed_fput (fs/file_table.c:444 (discriminator 3))
process_one_work (kernel/workqueue.c:3259)
worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
</TASK>
Modules linked in:
CR2: 0000000000000008
Fixes: 1279f9d9dec2 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")
Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v2:
* Add recvq locking everywhere we touch oob_skb (Paolo)
v1: https://lore.kernel.org/netdev/20240507170018.83385-1-kuniyu@amazon.com/
---
net/unix/af_unix.c | 18 ++++++++++++++----
net/unix/garbage.c | 4 +++-
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 9a6ad5974dff..c555464cf1fb 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2217,13 +2217,15 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
maybe_add_creds(skb, sock, other);
skb_get(skb);
+ scm_stat_add(other, skb);
+
+ spin_lock(&other->sk_receive_queue.lock);
if (ousk->oob_skb)
consume_skb(ousk->oob_skb);
-
WRITE_ONCE(ousk->oob_skb, skb);
+ __skb_queue_tail(&other->sk_receive_queue, skb);
+ spin_unlock(&other->sk_receive_queue.lock);
- scm_stat_add(other, skb);
- skb_queue_tail(&other->sk_receive_queue, skb);
sk_send_sigurg(other);
unix_state_unlock(other);
other->sk_data_ready(other);
@@ -2614,8 +2616,10 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state)
mutex_lock(&u->iolock);
unix_state_lock(sk);
+ spin_lock(&sk->sk_receive_queue.lock);
if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) {
+ spin_unlock(&sk->sk_receive_queue.lock);
unix_state_unlock(sk);
mutex_unlock(&u->iolock);
return -EINVAL;
@@ -2627,6 +2631,8 @@ static int unix_stream_recv_urg(struct unix_stream_read_state *state)
WRITE_ONCE(u->oob_skb, NULL);
else
skb_get(oob_skb);
+
+ spin_unlock(&sk->sk_receive_queue.lock);
unix_state_unlock(sk);
chunk = state->recv_actor(oob_skb, 0, chunk, state);
@@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
consume_skb(skb);
skb = NULL;
} else {
+ spin_lock(&sk->sk_receive_queue.lock);
+
if (skb == u->oob_skb) {
if (copied) {
skb = NULL;
@@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
} else if (flags & MSG_PEEK) {
skb = NULL;
} else {
- skb_unlink(skb, &sk->sk_receive_queue);
+ __skb_unlink(skb, &sk->sk_receive_queue);
WRITE_ONCE(u->oob_skb, NULL);
if (!WARN_ON_ONCE(skb_unref(skb)))
kfree_skb(skb);
skb = skb_peek(&sk->sk_receive_queue);
}
}
+
+ spin_unlock(&sk->sk_receive_queue.lock);
}
return skb;
}
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 0104be9d4704..b87e48e2b51b 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work)
scan_children(&u->sk, inc_inflight, &hitlist);
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+ spin_lock(&u->sk.sk_receive_queue.lock);
if (u->oob_skb) {
- kfree_skb(u->oob_skb);
+ WARN_ON_ONCE(skb_unref(u->oob_skb));
u->oob_skb = NULL;
}
+ spin_unlock(&u->sk.sk_receive_queue.lock);
#endif
}
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-10 9:39 [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock Kuniyuki Iwashima
@ 2024-05-10 10:44 ` Paolo Abeni
2024-05-10 10:54 ` Kuniyuki Iwashima
2024-05-12 14:47 ` Michal Luczaj
1 sibling, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2024-05-10 10:44 UTC (permalink / raw)
To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
Cc: Kuniyuki Iwashima, netdev, Billy Jheng Bing-Jhong
On Fri, 2024-05-10 at 18:39 +0900, Kuniyuki Iwashima wrote:
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 0104be9d4704..b87e48e2b51b 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work)
> scan_children(&u->sk, inc_inflight, &hitlist);
>
> #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> + spin_lock(&u->sk.sk_receive_queue.lock);
> if (u->oob_skb) {
> - kfree_skb(u->oob_skb);
> + WARN_ON_ONCE(skb_unref(u->oob_skb));
Sorry for not asking this first, but it's not clear to me why the above
change (just the 'WARN_ON_ONCE' introduction) is needed and if it's
related to the addressed issue???
Thanks!
Paolo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-10 10:44 ` Paolo Abeni
@ 2024-05-10 10:54 ` Kuniyuki Iwashima
2024-05-10 11:11 ` Paolo Abeni
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2024-05-10 10:54 UTC (permalink / raw)
To: pabeni; +Cc: billy, davem, edumazet, kuba, kuni1840, kuniyu, netdev
From: Paolo Abeni <pabeni@redhat.com>
Date: Fri, 10 May 2024 12:44:58 +0200
> On Fri, 2024-05-10 at 18:39 +0900, Kuniyuki Iwashima wrote:
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 0104be9d4704..b87e48e2b51b 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work)
> > scan_children(&u->sk, inc_inflight, &hitlist);
> >
> > #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> > + spin_lock(&u->sk.sk_receive_queue.lock);
> > if (u->oob_skb) {
> > - kfree_skb(u->oob_skb);
> > + WARN_ON_ONCE(skb_unref(u->oob_skb));
>
> Sorry for not asking this first, but it's not clear to me why the above
> change (just the 'WARN_ON_ONCE' introduction) is needed and if it's
> related to the addressed issue???
I think I added it to make it clear that here we don't actually free skb
and consistent with manage_oob().
But I don't have strong preference as it will be removed soon.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-10 10:54 ` Kuniyuki Iwashima
@ 2024-05-10 11:11 ` Paolo Abeni
0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2024-05-10 11:11 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: billy, davem, edumazet, kuba, kuni1840, netdev
On Fri, 2024-05-10 at 19:54 +0900, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Fri, 10 May 2024 12:44:58 +0200
> > On Fri, 2024-05-10 at 18:39 +0900, Kuniyuki Iwashima wrote:
> > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > index 0104be9d4704..b87e48e2b51b 100644
> > > --- a/net/unix/garbage.c
> > > +++ b/net/unix/garbage.c
> > > @@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work)
> > > scan_children(&u->sk, inc_inflight, &hitlist);
> > >
> > > #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> > > + spin_lock(&u->sk.sk_receive_queue.lock);
> > > if (u->oob_skb) {
> > > - kfree_skb(u->oob_skb);
> > > + WARN_ON_ONCE(skb_unref(u->oob_skb));
> >
> > Sorry for not asking this first, but it's not clear to me why the above
> > change (just the 'WARN_ON_ONCE' introduction) is needed and if it's
> > related to the addressed issue???
>
> I think I added it to make it clear that here we don't actually free skb
> and consistent with manage_oob().
>
> But I don't have strong preference as it will be removed soon.
Ok, thanks for the explanation. I'm fine with the above.
Acked-by: Paolo Abeni <pabeni@redhat.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-10 9:39 [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock Kuniyuki Iwashima
2024-05-10 10:44 ` Paolo Abeni
@ 2024-05-12 14:47 ` Michal Luczaj
2024-05-13 6:12 ` Kuniyuki Iwashima
1 sibling, 1 reply; 12+ messages in thread
From: Michal Luczaj @ 2024-05-12 14:47 UTC (permalink / raw)
To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Kuniyuki Iwashima, netdev, Billy Jheng Bing-Jhong
On 5/10/24 11:39, Kuniyuki Iwashima wrote:
> @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> consume_skb(skb);
> skb = NULL;
> } else {
> + spin_lock(&sk->sk_receive_queue.lock);
> +
> if (skb == u->oob_skb) {
> if (copied) {
> skb = NULL;
> @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> } else if (flags & MSG_PEEK) {
> skb = NULL;
> } else {
> - skb_unlink(skb, &sk->sk_receive_queue);
> + __skb_unlink(skb, &sk->sk_receive_queue);
> WRITE_ONCE(u->oob_skb, NULL);
> if (!WARN_ON_ONCE(skb_unref(skb)))
> kfree_skb(skb);
> skb = skb_peek(&sk->sk_receive_queue);
> }
> }
> +
> + spin_unlock(&sk->sk_receive_queue.lock);
> }
> return skb;
> }
Now it is
spin_lock(&sk->sk_receive_queue.lock)
kfree_skb
unix_destruct_scm
unix_notinflight
spin_lock(&unix_gc_lock)
I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
But that's benign, right?
thanks,
Michal
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-12 14:47 ` Michal Luczaj
@ 2024-05-13 6:12 ` Kuniyuki Iwashima
2024-05-13 6:40 ` Michal Luczaj
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2024-05-13 6:12 UTC (permalink / raw)
To: mhal; +Cc: billy, davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni
From: Michal Luczaj <mhal@rbox.co>
Date: Sun, 12 May 2024 16:47:11 +0200
> On 5/10/24 11:39, Kuniyuki Iwashima wrote:
> > @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> > consume_skb(skb);
> > skb = NULL;
> > } else {
> > + spin_lock(&sk->sk_receive_queue.lock);
> > +
> > if (skb == u->oob_skb) {
> > if (copied) {
> > skb = NULL;
> > @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> > } else if (flags & MSG_PEEK) {
> > skb = NULL;
> > } else {
> > - skb_unlink(skb, &sk->sk_receive_queue);
> > + __skb_unlink(skb, &sk->sk_receive_queue);
> > WRITE_ONCE(u->oob_skb, NULL);
> > if (!WARN_ON_ONCE(skb_unref(skb)))
> > kfree_skb(skb);
> > skb = skb_peek(&sk->sk_receive_queue);
> > }
> > }
> > +
> > + spin_unlock(&sk->sk_receive_queue.lock);
> > }
> > return skb;
> > }
>
> Now it is
>
> spin_lock(&sk->sk_receive_queue.lock)
> kfree_skb
This does not free skb actually and just drops a refcount by skb_get()
in queue_oob().
> unix_destruct_scm
So, here we don't reach unix_destruct_scm().
That's why I changed kfree_skb() to skb_unref() in __unix_gc().
Thanks!
> unix_notinflight
> spin_lock(&unix_gc_lock)
>
> I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
> But that's benign, right?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-13 6:12 ` Kuniyuki Iwashima
@ 2024-05-13 6:40 ` Michal Luczaj
2024-05-13 7:44 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Michal Luczaj @ 2024-05-13 6:40 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: billy, davem, edumazet, kuba, kuni1840, netdev, pabeni
On 5/13/24 08:12, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Sun, 12 May 2024 16:47:11 +0200
>> On 5/10/24 11:39, Kuniyuki Iwashima wrote:
>>> @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>>> consume_skb(skb);
>>> skb = NULL;
>>> } else {
>>> + spin_lock(&sk->sk_receive_queue.lock);
>>> +
>>> if (skb == u->oob_skb) {
>>> if (copied) {
>>> skb = NULL;
>>> @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>>> } else if (flags & MSG_PEEK) {
>>> skb = NULL;
>>> } else {
>>> - skb_unlink(skb, &sk->sk_receive_queue);
>>> + __skb_unlink(skb, &sk->sk_receive_queue);
>>> WRITE_ONCE(u->oob_skb, NULL);
>>> if (!WARN_ON_ONCE(skb_unref(skb)))
>>> kfree_skb(skb);
>>> skb = skb_peek(&sk->sk_receive_queue);
>>> }
>>> }
>>> +
>>> + spin_unlock(&sk->sk_receive_queue.lock);
>>> }
>>> return skb;
>>> }
>>
>> Now it is
>>
>> spin_lock(&sk->sk_receive_queue.lock)
>> kfree_skb
>
> This does not free skb actually and just drops a refcount by skb_get()
> in queue_oob().
I suspect you refer to change in __unix_gc()
if (u->oob_skb) {
- kfree_skb(u->oob_skb);
+ WARN_ON_ONCE(skb_unref(u->oob_skb));
}
What I'm talking about is the quoted above (unchanged) part in manage_oob():
if (!WARN_ON_ONCE(skb_unref(skb)))
kfree_skb(skb);
I might be missing something, but
from array import array
from socket import *
a, b = socketpair(AF_UNIX, SOCK_STREAM)
scm = (SOL_SOCKET, SCM_RIGHTS, array("i", [b.fileno()]))
b.sendmsg([b'x'], [scm], MSG_OOB)
b.close()
a.recv(MSG_DONTWAIT)
[ 72.513125] ======================================================
[ 72.513148] WARNING: possible circular locking dependency detected
[ 72.513170] 6.9.0-rc7nokasan+ #25 Not tainted
[ 72.513193] ------------------------------------------------------
[ 72.513215] python/1054 is trying to acquire lock:
[ 72.513237] ffffffff83563898 (unix_gc_lock){+.+.}-{2:2}, at: unix_notinflight+0x23/0x100
[ 72.513266]
but task is already holding lock:
[ 72.513288] ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
[ 72.513313]
which lock already depends on the new lock.
[ 72.513336]
the existing dependency chain (in reverse order) is:
[ 72.513358]
-> #1 (rlock-AF_UNIX){+.+.}-{2:2}:
[ 72.513381] _raw_spin_lock+0x2f/0x40
[ 72.513404] scan_inflight+0x36/0x1e0
[ 72.513428] __unix_gc+0x17c/0x4b0
[ 72.513450] process_one_work+0x217/0x700
[ 72.513474] worker_thread+0x1ca/0x3b0
[ 72.513497] kthread+0xdd/0x110
[ 72.513519] ret_from_fork+0x2d/0x50
[ 72.513543] ret_from_fork_asm+0x1a/0x30
[ 72.513565]
-> #0 (unix_gc_lock){+.+.}-{2:2}:
[ 72.513589] __lock_acquire+0x137b/0x20e0
[ 72.513612] lock_acquire+0xc5/0x2c0
[ 72.513635] _raw_spin_lock+0x2f/0x40
[ 72.513657] unix_notinflight+0x23/0x100
[ 72.513680] unix_destruct_scm+0x95/0xa0
[ 72.513702] skb_release_head_state+0x20/0x60
[ 72.513726] kfree_skb_reason+0x53/0x1e0
[ 72.513748] unix_stream_read_generic+0xb69/0xbc0
[ 72.513771] unix_stream_recvmsg+0x68/0x80
[ 72.513794] sock_recvmsg+0xb9/0xc0
[ 72.513817] __sys_recvfrom+0xa1/0x110
[ 72.513840] __x64_sys_recvfrom+0x20/0x30
[ 72.513862] do_syscall_64+0x93/0x190
[ 72.513886] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 72.513909]
other info that might help us debug this:
[ 72.513932] Possible unsafe locking scenario:
[ 72.513954] CPU0 CPU1
[ 72.513976] ---- ----
[ 72.513998] lock(rlock-AF_UNIX);
[ 72.514020] lock(unix_gc_lock);
[ 72.514043] lock(rlock-AF_UNIX);
[ 72.514066] lock(unix_gc_lock);
[ 72.514088]
*** DEADLOCK ***
[ 72.514110] 3 locks held by python/1054:
[ 72.514133] #0: ffff88811eb10cf0 (&u->iolock){+.+.}-{3:3}, at: unix_stream_read_generic+0xd4/0xbc0
[ 72.514158] #1: ffff88811eb10de0 (&u->lock){+.+.}-{2:2}, at: unix_stream_read_generic+0x110/0xbc0
[ 72.514184] #2: ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
[ 72.514209]
stack backtrace:
[ 72.514231] CPU: 4 PID: 1054 Comm: python Not tainted 6.9.0-rc7nokasan+ #25
[ 72.514254] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 72.514278] Call Trace:
[ 72.514300] <TASK>
[ 72.514321] dump_stack_lvl+0x73/0xb0
[ 72.514345] check_noncircular+0x108/0x120
[ 72.514369] __lock_acquire+0x137b/0x20e0
[ 72.514392] lock_acquire+0xc5/0x2c0
[ 72.514415] ? unix_notinflight+0x23/0x100
[ 72.514439] _raw_spin_lock+0x2f/0x40
[ 72.514461] ? unix_notinflight+0x23/0x100
[ 72.514484] unix_notinflight+0x23/0x100
[ 72.514507] unix_destruct_scm+0x95/0xa0
[ 72.514530] skb_release_head_state+0x20/0x60
[ 72.514553] kfree_skb_reason+0x53/0x1e0
[ 72.514575] unix_stream_read_generic+0xb69/0xbc0
[ 72.514600] unix_stream_recvmsg+0x68/0x80
[ 72.514623] ? __pfx_unix_stream_read_actor+0x10/0x10
[ 72.514646] sock_recvmsg+0xb9/0xc0
[ 72.514669] __sys_recvfrom+0xa1/0x110
[ 72.514692] ? lock_release+0x133/0x290
[ 72.514715] ? syscall_exit_to_user_mode+0x11/0x280
[ 72.514739] ? do_syscall_64+0xa0/0x190
[ 72.514762] __x64_sys_recvfrom+0x20/0x30
[ 72.514784] do_syscall_64+0x93/0x190
[ 72.514807] ? clear_bhb_loop+0x45/0xa0
[ 72.514829] ? clear_bhb_loop+0x45/0xa0
[ 72.514852] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 72.514875] RIP: 0033:0x7fc16bb3594d
[ 72.514899] Code: 02 02 00 00 00 5d c3 66 0f 1f 44 00 00 f3 0f 1e fa 80 3d 25 8a 0c 00 00 41 89 ca 74 20 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89
[ 72.514925] RSP: 002b:00007fffae4ab2f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
[ 72.514949] RAX: ffffffffffffffda RBX: 00007fffae4ab3c8 RCX: 00007fc16bb3594d
[ 72.514972] RDX: 0000000000000040 RSI: 00007fc15e298f30 RDI: 0000000000000003
[ 72.514994] RBP: 00007fffae4ab310 R08: 0000000000000000 R09: 0000000000000000
[ 72.515017] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc15e34af90
[ 72.515040] R13: 0000000000000000 R14: ffffffffc4653600 R15: 0000000000000000
[ 72.515064] </TASK>
>> unix_destruct_scm
>
> So, here we don't reach unix_destruct_scm().
>
> That's why I changed kfree_skb() to skb_unref() in __unix_gc().
>
> Thanks!
>
>
>> unix_notinflight
>> spin_lock(&unix_gc_lock)
>>
>> I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
>> But that's benign, right?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-13 6:40 ` Michal Luczaj
@ 2024-05-13 7:44 ` Kuniyuki Iwashima
2024-05-13 9:14 ` Michal Luczaj
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2024-05-13 7:44 UTC (permalink / raw)
To: mhal; +Cc: billy, davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni
From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 13 May 2024 08:40:34 +0200
> On 5/13/24 08:12, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Sun, 12 May 2024 16:47:11 +0200
> >> On 5/10/24 11:39, Kuniyuki Iwashima wrote:
> >>> @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >>> consume_skb(skb);
> >>> skb = NULL;
> >>> } else {
> >>> + spin_lock(&sk->sk_receive_queue.lock);
> >>> +
> >>> if (skb == u->oob_skb) {
> >>> if (copied) {
> >>> skb = NULL;
> >>> @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >>> } else if (flags & MSG_PEEK) {
> >>> skb = NULL;
> >>> } else {
> >>> - skb_unlink(skb, &sk->sk_receive_queue);
> >>> + __skb_unlink(skb, &sk->sk_receive_queue);
> >>> WRITE_ONCE(u->oob_skb, NULL);
> >>> if (!WARN_ON_ONCE(skb_unref(skb)))
> >>> kfree_skb(skb);
> >>> skb = skb_peek(&sk->sk_receive_queue);
> >>> }
> >>> }
> >>> +
> >>> + spin_unlock(&sk->sk_receive_queue.lock);
> >>> }
> >>> return skb;
> >>> }
> >>
> >> Now it is
> >>
> >> spin_lock(&sk->sk_receive_queue.lock)
> >> kfree_skb
> >
> > This does not free skb actually and just drops a refcount by skb_get()
> > in queue_oob().
>
> I suspect you refer to change in __unix_gc()
>
> if (u->oob_skb) {
> - kfree_skb(u->oob_skb);
> + WARN_ON_ONCE(skb_unref(u->oob_skb));
> }
>
> What I'm talking about is the quoted above (unchanged) part in manage_oob():
>
> if (!WARN_ON_ONCE(skb_unref(skb)))
> kfree_skb(skb);
Ah, I got your point, good catch!
Somehow I was thinking of new GC where alive recvq is not touched
and lockdep would end up with false-positive.
We need to delay freeing oob_skb in that case like below.
I'll respin v3 later, thanks!
---8<---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index c555464cf1fb..35ca2be2c984 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2661,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
consume_skb(skb);
skb = NULL;
} else {
+ struct sk_buff *unlinked_skb = NULL;
+
spin_lock(&sk->sk_receive_queue.lock);
if (skb == u->oob_skb) {
@@ -2676,13 +2678,18 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
} else {
__skb_unlink(skb, &sk->sk_receive_queue);
WRITE_ONCE(u->oob_skb, NULL);
- if (!WARN_ON_ONCE(skb_unref(skb)))
- kfree_skb(skb);
+ unlinked_skb = skb;
skb = skb_peek(&sk->sk_receive_queue);
}
}
spin_unlock(&sk->sk_receive_queue.lock);
+
+ if (unlinked_skb) {
+ WARN_ON_ONCE(skb_unref(unlinked_skb));
+ kfree_skb(unlinked_skb);
+ }
+
}
return skb;
}
---8<---
>
> I might be missing something, but
>
> from array import array
> from socket import *
> a, b = socketpair(AF_UNIX, SOCK_STREAM)
> scm = (SOL_SOCKET, SCM_RIGHTS, array("i", [b.fileno()]))
> b.sendmsg([b'x'], [scm], MSG_OOB)
> b.close()
> a.recv(MSG_DONTWAIT)
>
> [ 72.513125] ======================================================
> [ 72.513148] WARNING: possible circular locking dependency detected
> [ 72.513170] 6.9.0-rc7nokasan+ #25 Not tainted
> [ 72.513193] ------------------------------------------------------
> [ 72.513215] python/1054 is trying to acquire lock:
> [ 72.513237] ffffffff83563898 (unix_gc_lock){+.+.}-{2:2}, at: unix_notinflight+0x23/0x100
> [ 72.513266]
> but task is already holding lock:
> [ 72.513288] ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
> [ 72.513313]
> which lock already depends on the new lock.
>
> [ 72.513336]
> the existing dependency chain (in reverse order) is:
> [ 72.513358]
> -> #1 (rlock-AF_UNIX){+.+.}-{2:2}:
> [ 72.513381] _raw_spin_lock+0x2f/0x40
> [ 72.513404] scan_inflight+0x36/0x1e0
> [ 72.513428] __unix_gc+0x17c/0x4b0
> [ 72.513450] process_one_work+0x217/0x700
> [ 72.513474] worker_thread+0x1ca/0x3b0
> [ 72.513497] kthread+0xdd/0x110
> [ 72.513519] ret_from_fork+0x2d/0x50
> [ 72.513543] ret_from_fork_asm+0x1a/0x30
> [ 72.513565]
> -> #0 (unix_gc_lock){+.+.}-{2:2}:
> [ 72.513589] __lock_acquire+0x137b/0x20e0
> [ 72.513612] lock_acquire+0xc5/0x2c0
> [ 72.513635] _raw_spin_lock+0x2f/0x40
> [ 72.513657] unix_notinflight+0x23/0x100
> [ 72.513680] unix_destruct_scm+0x95/0xa0
> [ 72.513702] skb_release_head_state+0x20/0x60
> [ 72.513726] kfree_skb_reason+0x53/0x1e0
> [ 72.513748] unix_stream_read_generic+0xb69/0xbc0
> [ 72.513771] unix_stream_recvmsg+0x68/0x80
> [ 72.513794] sock_recvmsg+0xb9/0xc0
> [ 72.513817] __sys_recvfrom+0xa1/0x110
> [ 72.513840] __x64_sys_recvfrom+0x20/0x30
> [ 72.513862] do_syscall_64+0x93/0x190
> [ 72.513886] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 72.513909]
> other info that might help us debug this:
>
> [ 72.513932] Possible unsafe locking scenario:
>
> [ 72.513954] CPU0 CPU1
> [ 72.513976] ---- ----
> [ 72.513998] lock(rlock-AF_UNIX);
> [ 72.514020] lock(unix_gc_lock);
> [ 72.514043] lock(rlock-AF_UNIX);
> [ 72.514066] lock(unix_gc_lock);
> [ 72.514088]
> *** DEADLOCK ***
>
> [ 72.514110] 3 locks held by python/1054:
> [ 72.514133] #0: ffff88811eb10cf0 (&u->iolock){+.+.}-{3:3}, at: unix_stream_read_generic+0xd4/0xbc0
> [ 72.514158] #1: ffff88811eb10de0 (&u->lock){+.+.}-{2:2}, at: unix_stream_read_generic+0x110/0xbc0
> [ 72.514184] #2: ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
> [ 72.514209]
> stack backtrace:
> [ 72.514231] CPU: 4 PID: 1054 Comm: python Not tainted 6.9.0-rc7nokasan+ #25
> [ 72.514254] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [ 72.514278] Call Trace:
> [ 72.514300] <TASK>
> [ 72.514321] dump_stack_lvl+0x73/0xb0
> [ 72.514345] check_noncircular+0x108/0x120
> [ 72.514369] __lock_acquire+0x137b/0x20e0
> [ 72.514392] lock_acquire+0xc5/0x2c0
> [ 72.514415] ? unix_notinflight+0x23/0x100
> [ 72.514439] _raw_spin_lock+0x2f/0x40
> [ 72.514461] ? unix_notinflight+0x23/0x100
> [ 72.514484] unix_notinflight+0x23/0x100
> [ 72.514507] unix_destruct_scm+0x95/0xa0
> [ 72.514530] skb_release_head_state+0x20/0x60
> [ 72.514553] kfree_skb_reason+0x53/0x1e0
> [ 72.514575] unix_stream_read_generic+0xb69/0xbc0
> [ 72.514600] unix_stream_recvmsg+0x68/0x80
> [ 72.514623] ? __pfx_unix_stream_read_actor+0x10/0x10
> [ 72.514646] sock_recvmsg+0xb9/0xc0
> [ 72.514669] __sys_recvfrom+0xa1/0x110
> [ 72.514692] ? lock_release+0x133/0x290
> [ 72.514715] ? syscall_exit_to_user_mode+0x11/0x280
> [ 72.514739] ? do_syscall_64+0xa0/0x190
> [ 72.514762] __x64_sys_recvfrom+0x20/0x30
> [ 72.514784] do_syscall_64+0x93/0x190
> [ 72.514807] ? clear_bhb_loop+0x45/0xa0
> [ 72.514829] ? clear_bhb_loop+0x45/0xa0
> [ 72.514852] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 72.514875] RIP: 0033:0x7fc16bb3594d
> [ 72.514899] Code: 02 02 00 00 00 5d c3 66 0f 1f 44 00 00 f3 0f 1e fa 80 3d 25 8a 0c 00 00 41 89 ca 74 20 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89
> [ 72.514925] RSP: 002b:00007fffae4ab2f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
> [ 72.514949] RAX: ffffffffffffffda RBX: 00007fffae4ab3c8 RCX: 00007fc16bb3594d
> [ 72.514972] RDX: 0000000000000040 RSI: 00007fc15e298f30 RDI: 0000000000000003
> [ 72.514994] RBP: 00007fffae4ab310 R08: 0000000000000000 R09: 0000000000000000
> [ 72.515017] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc15e34af90
> [ 72.515040] R13: 0000000000000000 R14: ffffffffc4653600 R15: 0000000000000000
> [ 72.515064] </TASK>
>
> >> unix_destruct_scm
> >
> > So, here we don't reach unix_destruct_scm().
> >
> > That's why I changed kfree_skb() to skb_unref() in __unix_gc().
> >
> > Thanks!
> >
> >
> >> unix_notinflight
> >> spin_lock(&unix_gc_lock)
> >>
> >> I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
> >> But that's benign, right?
>
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-13 7:44 ` Kuniyuki Iwashima
@ 2024-05-13 9:14 ` Michal Luczaj
2024-05-13 9:24 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Michal Luczaj @ 2024-05-13 9:14 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: billy, davem, edumazet, kuba, kuni1840, netdev, pabeni
On 5/13/24 09:44, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 13 May 2024 08:40:34 +0200
>> What I'm talking about is the quoted above (unchanged) part in manage_oob():
>>
>> if (!WARN_ON_ONCE(skb_unref(skb)))
>> kfree_skb(skb);
>
> Ah, I got your point, good catch!
>
> Somehow I was thinking of new GC where alive recvq is not touched
> and lockdep would end up with false-positive.
>
> We need to delay freeing oob_skb in that case like below.
> ...
So this not a lockdep false positive after all?
Here's my understanding: the only way manage_oob() can lead to an inverted locking
order is when the receiver socket is _not_ in gc_candidates. And when it's not
there, no risk of deadlock. What do you think?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-13 9:14 ` Michal Luczaj
@ 2024-05-13 9:24 ` Kuniyuki Iwashima
2024-05-13 10:15 ` Michal Luczaj
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2024-05-13 9:24 UTC (permalink / raw)
To: mhal; +Cc: billy, davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni
From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 13 May 2024 11:14:39 +0200
> On 5/13/24 09:44, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 13 May 2024 08:40:34 +0200
> >> What I'm talking about is the quoted above (unchanged) part in manage_oob():
> >>
> >> if (!WARN_ON_ONCE(skb_unref(skb)))
> >> kfree_skb(skb);
> >
> > Ah, I got your point, good catch!
> >
> > Somehow I was thinking of new GC where alive recvq is not touched
> > and lockdep would end up with false-positive.
> >
> > We need to delay freeing oob_skb in that case like below.
> > ...
>
> So this not a lockdep false positive after all?
>
> Here's my understanding: the only way manage_oob() can lead to an inverted locking
> order is when the receiver socket is _not_ in gc_candidates. And when it's not
> there, no risk of deadlock. What do you think?
For the new GC, it's false positive, but for the old GC, it's not.
The old GC locks unix_gc_lock and could iterate alive sockets if
they are linked to gc_inflight_list, and then recvq is locked.
The new GC only touches recvq when it's detected as dead, meaning
there's no recv() call in progress for the socket.
This patch is for both GC impl, so we need to free skb after
unlocking recvq in manage_oob().
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-13 9:24 ` Kuniyuki Iwashima
@ 2024-05-13 10:15 ` Michal Luczaj
2024-05-13 12:44 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Michal Luczaj @ 2024-05-13 10:15 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: billy, davem, edumazet, kuba, kuni1840, netdev, pabeni
On 5/13/24 11:24, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 13 May 2024 11:14:39 +0200
>> On 5/13/24 09:44, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Mon, 13 May 2024 08:40:34 +0200
>>>> What I'm talking about is the quoted above (unchanged) part in manage_oob():
>>>>
>>>> if (!WARN_ON_ONCE(skb_unref(skb)))
>>>> kfree_skb(skb);
>>>
>>> Ah, I got your point, good catch!
>>>
>>> Somehow I was thinking of new GC where alive recvq is not touched
>>> and lockdep would end up with false-positive.
>>>
>>> We need to delay freeing oob_skb in that case like below.
>>> ...
>>
>> So this not a lockdep false positive after all?
>>
>> Here's my understanding: the only way manage_oob() can lead to an inverted locking
>> order is when the receiver socket is _not_ in gc_candidates. And when it's not
>> there, no risk of deadlock. What do you think?
>
> For the new GC, it's false positive, but for the old GC, it's not.
>
> The old GC locks unix_gc_lock and could iterate alive sockets if
> they are linked to gc_inflight_list, and then recvq is locked.
> ...
The recvq is locked not for all sockets in gc_inflight_list, but only its
subset, gc_candidates, i.e. sockets that fulfil the 'total_refs == u->inflight'
condition, right? So doesn't this imply that our receiver is not user-reachable
and manage_oob() cannot be called/raced?
I wouldn't be surprised if I was missing something important, but it's not like
I didn't try deadlocking this code :)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
2024-05-13 10:15 ` Michal Luczaj
@ 2024-05-13 12:44 ` Kuniyuki Iwashima
0 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2024-05-13 12:44 UTC (permalink / raw)
To: mhal; +Cc: billy, davem, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni
Date: Mon, 13 May 2024 12:15:57 +0200
From: Michal Luczaj <mhal@rbox.co>
> On 5/13/24 11:24, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 13 May 2024 11:14:39 +0200
> >> On 5/13/24 09:44, Kuniyuki Iwashima wrote:
> >>> From: Michal Luczaj <mhal@rbox.co>
> >>> Date: Mon, 13 May 2024 08:40:34 +0200
> >>>> What I'm talking about is the quoted above (unchanged) part in manage_oob():
> >>>>
> >>>> if (!WARN_ON_ONCE(skb_unref(skb)))
> >>>> kfree_skb(skb);
> >>>
> >>> Ah, I got your point, good catch!
> >>>
> >>> Somehow I was thinking of new GC where alive recvq is not touched
> >>> and lockdep would end up with false-positive.
> >>>
> >>> We need to delay freeing oob_skb in that case like below.
> >>> ...
> >>
> >> So this not a lockdep false positive after all?
> >>
> >> Here's my understanding: the only way manage_oob() can lead to an inverted locking
> >> order is when the receiver socket is _not_ in gc_candidates. And when it's not
> >> there, no risk of deadlock. What do you think?
> >
> > For the new GC, it's false positive, but for the old GC, it's not.
> >
> > The old GC locks unix_gc_lock and could iterate alive sockets if
> > they are linked to gc_inflight_list, and then recvq is locked.
> > ...
>
> The recvq is locked not for all sockets in gc_inflight_list, but only its
> subset, gc_candidates, i.e. sockets that fulfil the 'total_refs == u->inflight'
> condition, right? So doesn't this imply that our receiver is not user-reachable
> and manage_oob() cannot be called/raced?
Ah, yes, the splat was false-positive for the old GC too.
Instead of using a differenct class for the recvq in GC, it would
be better to unlock it earlier in manage_oob(), so I'll post v3.
Thanks!
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-05-13 12:44 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-10 9:39 [PATCH v2 net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock Kuniyuki Iwashima
2024-05-10 10:44 ` Paolo Abeni
2024-05-10 10:54 ` Kuniyuki Iwashima
2024-05-10 11:11 ` Paolo Abeni
2024-05-12 14:47 ` Michal Luczaj
2024-05-13 6:12 ` Kuniyuki Iwashima
2024-05-13 6:40 ` Michal Luczaj
2024-05-13 7:44 ` Kuniyuki Iwashima
2024-05-13 9:14 ` Michal Luczaj
2024-05-13 9:24 ` Kuniyuki Iwashima
2024-05-13 10:15 ` Michal Luczaj
2024-05-13 12:44 ` Kuniyuki Iwashima
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).