* [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
@ 2024-09-12 6:27 Shin'ichiro Kawasaki
2024-09-12 6:59 ` Sagi Grimberg
0 siblings, 1 reply; 7+ messages in thread
From: Shin'ichiro Kawasaki @ 2024-09-12 6:27 UTC (permalink / raw)
To: linux-nvme, Keith Busch
Cc: Akinobu Mita, Hannes Reinecke, Martin Belanger, Christoph Hellwig,
Sagi Grimberg, Shin'ichiro Kawasaki
Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during
error recovery") added a mutex_lock() call for the queue->queue_lock
in nvme_tcp_get_address(). However, the mutex_lock() races with
mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220
Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs]
CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:__mutex_lock+0xcf0/0x1220
Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
RSP: 0018:ffff88811305f760 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? __warn.cold+0x5b/0x1af
? __mutex_lock+0xcf0/0x1220
? report_bug+0x1ec/0x390
? handle_bug+0x3c/0x80
? exc_invalid_op+0x13/0x40
? asm_exc_invalid_op+0x16/0x20
? __mutex_lock+0xcf0/0x1220
? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
? __pfx___mutex_lock+0x10/0x10
? __lock_acquire+0xd6a/0x59e0
? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
dev_attr_show+0x42/0x80
? __asan_memset+0x1f/0x40
sysfs_kf_seq_show+0x1f0/0x370
seq_read_iter+0x2cb/0x1130
? rw_verify_area+0x3b1/0x590
? __mutex_lock+0x433/0x1220
vfs_read+0x6a6/0xa20
? lockdep_hardirqs_on+0x78/0x100
? __pfx_vfs_read+0x10/0x10
ksys_read+0xf7/0x1d0
? __pfx_ksys_read+0x10/0x10
? __x64_sys_openat+0x105/0x1d0
do_syscall_64+0x93/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? __pfx_ksys_read+0x10/0x10
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? do_syscall_64+0x9f/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f9713f55cfa
Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
</TASK>
The WARN is observed when the blktests test case nvme/014 is repeated
with tcp transport. It is rare, and 200 times repeat is required to
recreate in my test environment.
To avoid the WARN, serialize the two functions nvme_tcp_get_address()
and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it
when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the
flag cleared so that it does not run until nvme_tcp_get_address() ends.
Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address()
starts. This flag is cleared at the nvme_tcp_free_queue() start, then it
ensures that nvme_tcp_get_address() is not executed while
nvme_tcp_free_queue() runs.
Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in
same manner as nvme_tcp_get_address(). However, nvme_tcp_stop_queue() is
called in sequence before nvme_tcp_free_queue(), then it is safe to
assume that the queue->queue_lock is not yet destroyed at
nvme_tcp_stop_queue() call. Therefore this fix does not touch
nvme_tcp_stop_queue().
Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during error recovery")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
drivers/nvme/host/tcp.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a2a47d3ab99f..1457ffda207b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags {
NVME_TCP_Q_ALLOCATED = 0,
NVME_TCP_Q_LIVE = 1,
NVME_TCP_Q_POLLING = 2,
+ NVME_TCP_Q_REFERRED = 3,
};
enum nvme_tcp_recv_state {
@@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
return;
+ wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED,
+ TASK_UNINTERRUPTIBLE);
+
if (queue->hdr_digest || queue->data_digest)
nvme_tcp_free_crypto(queue);
@@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
{
struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0];
struct sockaddr_storage src_addr;
- int ret, len;
+ int ret, len = 0;
+
+ /* Serialize this function and nvme_tcp_free_queue() */
+ if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags))
+ return 0;
+
+ if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
+ goto out;
len = nvmf_get_address(ctrl, buf, size);
@@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
done:
mutex_unlock(&queue->queue_lock);
+out:
+ clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags);
+
return len;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock 2024-09-12 6:27 [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock Shin'ichiro Kawasaki @ 2024-09-12 6:59 ` Sagi Grimberg 2024-09-12 9:16 ` Hannes Reinecke 0 siblings, 1 reply; 7+ messages in thread From: Sagi Grimberg @ 2024-09-12 6:59 UTC (permalink / raw) To: Shin'ichiro Kawasaki, linux-nvme, Keith Busch Cc: Akinobu Mita, Hannes Reinecke, Martin Belanger, Christoph Hellwig On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote: > Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during > error recovery") added a mutex_lock() call for the queue->queue_lock > in nvme_tcp_get_address(). However, the mutex_lock() races with > mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below. > > DEBUG_LOCKS_WARN_ON(lock->magic != lock) > WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220 > Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs] > CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 > RIP: 0010:__mutex_lock+0xcf0/0x1220 > Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 > RSP: 0018:ffff88811305f760 EFLAGS: 00010286 > RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 > RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341 > R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000 > R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058 > FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > ? __warn.cold+0x5b/0x1af > ? __mutex_lock+0xcf0/0x1220 > ? report_bug+0x1ec/0x390 > ? handle_bug+0x3c/0x80 > ? exc_invalid_op+0x13/0x40 > ? asm_exc_invalid_op+0x16/0x20 > ? __mutex_lock+0xcf0/0x1220 > ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] > ? __pfx___mutex_lock+0x10/0x10 > ? __lock_acquire+0xd6a/0x59e0 > ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] > nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] > ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp] > nvme_sysfs_show_address+0x81/0xc0 [nvme_core] > dev_attr_show+0x42/0x80 > ? __asan_memset+0x1f/0x40 > sysfs_kf_seq_show+0x1f0/0x370 > seq_read_iter+0x2cb/0x1130 > ? rw_verify_area+0x3b1/0x590 > ? __mutex_lock+0x433/0x1220 > vfs_read+0x6a6/0xa20 > ? lockdep_hardirqs_on+0x78/0x100 > ? __pfx_vfs_read+0x10/0x10 > ksys_read+0xf7/0x1d0 > ? __pfx_ksys_read+0x10/0x10 > ? __x64_sys_openat+0x105/0x1d0 > do_syscall_64+0x93/0x180 > ? lockdep_hardirqs_on_prepare+0x16d/0x400 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on+0x78/0x100 > ? do_syscall_64+0x9f/0x180 > ? __pfx_ksys_read+0x10/0x10 > ? lockdep_hardirqs_on_prepare+0x16d/0x400 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on+0x78/0x100 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on_prepare+0x16d/0x400 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on+0x78/0x100 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on_prepare+0x16d/0x400 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on+0x78/0x100 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on_prepare+0x16d/0x400 > ? do_syscall_64+0x9f/0x180 > ? lockdep_hardirqs_on+0x78/0x100 > ? do_syscall_64+0x9f/0x180 > ? do_syscall_64+0x9f/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7f9713f55cfa > Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b > RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 > RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa > RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011 > RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff > R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148 > R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0 > </TASK> > > The WARN is observed when the blktests test case nvme/014 is repeated > with tcp transport. It is rare, and 200 times repeat is required to > recreate in my test environment. > > To avoid the WARN, serialize the two functions nvme_tcp_get_address() > and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it > when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the > flag cleared so that it does not run until nvme_tcp_get_address() ends. > Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address() > starts. This flag is cleared at the nvme_tcp_free_queue() start, then it > ensures that nvme_tcp_get_address() is not executed while > nvme_tcp_free_queue() runs. > > Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in > same manner as nvme_tcp_get_address(). However, nvme_tcp_stop_queue() is > called in sequence before nvme_tcp_free_queue(), then it is safe to > assume that the queue->queue_lock is not yet destroyed at > nvme_tcp_stop_queue() call. Therefore this fix does not touch > nvme_tcp_stop_queue(). > > Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during error recovery") > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > --- > drivers/nvme/host/tcp.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index a2a47d3ab99f..1457ffda207b 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags { > NVME_TCP_Q_ALLOCATED = 0, > NVME_TCP_Q_LIVE = 1, > NVME_TCP_Q_POLLING = 2, > + NVME_TCP_Q_REFERRED = 3, > }; > > enum nvme_tcp_recv_state { > @@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid) > if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) > return; > > + wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED, > + TASK_UNINTERRUPTIBLE); > + > if (queue->hdr_digest || queue->data_digest) > nvme_tcp_free_crypto(queue); > > @@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) > { > struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0]; > struct sockaddr_storage src_addr; > - int ret, len; > + int ret, len = 0; > + > + /* Serialize this function and nvme_tcp_free_queue() */ > + if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags)) > + return 0; > + > + if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) > + goto out; > > len = nvmf_get_address(ctrl, buf, size); > > @@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) > done: > mutex_unlock(&queue->queue_lock); > > +out: > + clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags); > + > return len; > } > This is really not needed.. You can already serialize with: -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index a2a47d3ab99f..91c19d06cae9 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid) if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) return; + /* make sure any concurrent .get_address completes */ + mutex_lock(&queue->queue_lock); + mutex_unlock(&queue->queue_lock); + if (queue->hdr_digest || queue->data_digest) nvme_tcp_free_crypto(queue); -- ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock 2024-09-12 6:59 ` Sagi Grimberg @ 2024-09-12 9:16 ` Hannes Reinecke 2024-09-12 10:20 ` Sagi Grimberg 0 siblings, 1 reply; 7+ messages in thread From: Hannes Reinecke @ 2024-09-12 9:16 UTC (permalink / raw) To: Sagi Grimberg, Shin'ichiro Kawasaki, linux-nvme, Keith Busch Cc: Akinobu Mita, Martin Belanger, Christoph Hellwig On 9/12/24 08:59, Sagi Grimberg wrote: > > > > On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote: >> Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during >> error recovery") added a mutex_lock() call for the queue->queue_lock >> in nvme_tcp_get_address(). However, the mutex_lock() races with >> mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below. >> >> DEBUG_LOCKS_WARN_ON(lock->magic != lock) >> WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 >> __mutex_lock+0xcf0/0x1220 >> Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm >> ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib >> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct >> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set >> nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc >> parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs >> drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx >> floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw >> ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs] >> CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 1.16.3-2.fc40 04/01/2014 >> RIP: 0010:__mutex_lock+0xcf0/0x1220 >> Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 >> ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b >> e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 >> RSP: 0018:ffff88811305f760 EFLAGS: 00010286 >> RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 >> RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341 >> R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000 >> R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058 >> FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000) >> knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> <TASK> >> ? __warn.cold+0x5b/0x1af >> ? __mutex_lock+0xcf0/0x1220 >> ? report_bug+0x1ec/0x390 >> ? handle_bug+0x3c/0x80 >> ? exc_invalid_op+0x13/0x40 >> ? asm_exc_invalid_op+0x16/0x20 >> ? __mutex_lock+0xcf0/0x1220 >> ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] >> ? __pfx___mutex_lock+0x10/0x10 >> ? __lock_acquire+0xd6a/0x59e0 >> ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] >> nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] >> ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp] >> nvme_sysfs_show_address+0x81/0xc0 [nvme_core] >> dev_attr_show+0x42/0x80 >> ? __asan_memset+0x1f/0x40 >> sysfs_kf_seq_show+0x1f0/0x370 >> seq_read_iter+0x2cb/0x1130 >> ? rw_verify_area+0x3b1/0x590 >> ? __mutex_lock+0x433/0x1220 >> vfs_read+0x6a6/0xa20 >> ? lockdep_hardirqs_on+0x78/0x100 >> ? __pfx_vfs_read+0x10/0x10 >> ksys_read+0xf7/0x1d0 >> ? __pfx_ksys_read+0x10/0x10 >> ? __x64_sys_openat+0x105/0x1d0 >> do_syscall_64+0x93/0x180 >> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on+0x78/0x100 >> ? do_syscall_64+0x9f/0x180 >> ? __pfx_ksys_read+0x10/0x10 >> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on+0x78/0x100 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on+0x78/0x100 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on+0x78/0x100 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >> ? do_syscall_64+0x9f/0x180 >> ? lockdep_hardirqs_on+0x78/0x100 >> ? do_syscall_64+0x9f/0x180 >> ? do_syscall_64+0x9f/0x180 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> RIP: 0033:0x7f9713f55cfa >> Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 >> 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d >> 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b >> RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 >> RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa >> RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011 >> RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff >> R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148 >> R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0 >> </TASK> >> >> The WARN is observed when the blktests test case nvme/014 is repeated >> with tcp transport. It is rare, and 200 times repeat is required to >> recreate in my test environment. >> >> To avoid the WARN, serialize the two functions nvme_tcp_get_address() >> and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it >> when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the >> flag cleared so that it does not run until nvme_tcp_get_address() ends. >> Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address() >> starts. This flag is cleared at the nvme_tcp_free_queue() start, then it >> ensures that nvme_tcp_get_address() is not executed while >> nvme_tcp_free_queue() runs. >> >> Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in >> same manner as nvme_tcp_get_address(). However, nvme_tcp_stop_queue() is >> called in sequence before nvme_tcp_free_queue(), then it is safe to >> assume that the queue->queue_lock is not yet destroyed at >> nvme_tcp_stop_queue() call. Therefore this fix does not touch >> nvme_tcp_stop_queue(). >> >> Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during >> error recovery") >> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> >> --- >> drivers/nvme/host/tcp.c | 16 +++++++++++++++- >> 1 file changed, 15 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >> index a2a47d3ab99f..1457ffda207b 100644 >> --- a/drivers/nvme/host/tcp.c >> +++ b/drivers/nvme/host/tcp.c >> @@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags { >> NVME_TCP_Q_ALLOCATED = 0, >> NVME_TCP_Q_LIVE = 1, >> NVME_TCP_Q_POLLING = 2, >> + NVME_TCP_Q_REFERRED = 3, >> }; >> enum nvme_tcp_recv_state { >> @@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct nvme_ctrl >> *nctrl, int qid) >> if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) >> return; >> + wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED, >> + TASK_UNINTERRUPTIBLE); >> + >> if (queue->hdr_digest || queue->data_digest) >> nvme_tcp_free_crypto(queue); >> @@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct >> nvme_ctrl *ctrl, char *buf, int size) >> { >> struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0]; >> struct sockaddr_storage src_addr; >> - int ret, len; >> + int ret, len = 0; >> + >> + /* Serialize this function and nvme_tcp_free_queue() */ >> + if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags)) >> + return 0; >> + >> + if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) >> + goto out; >> len = nvmf_get_address(ctrl, buf, size); >> @@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl >> *ctrl, char *buf, int size) >> done: >> mutex_unlock(&queue->queue_lock); >> +out: >> + clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags); >> + >> return len; >> } > > This is really not needed.. You can already serialize with: > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index a2a47d3ab99f..91c19d06cae9 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct nvme_ctrl > *nctrl, int qid) > if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) > return; > > + /* make sure any concurrent .get_address completes */ > + mutex_lock(&queue->queue_lock); > + mutex_unlock(&queue->queue_lock); > + > if (queue->hdr_digest || queue->data_digest) > nvme_tcp_free_crypto(queue); Wouldn't it be sufficient to check for NVME_TCP_Q_LIVE outside of the mutex? diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 76c8e01d8f08..0db3dde3a478 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2688,10 +2688,11 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) len = nvmf_get_address(ctrl, buf, size); + if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) + return len; + mutex_lock(&queue->queue_lock); - if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) - goto done; ret = kernel_getsockname(queue->sock, (struct sockaddr *)&src_addr); if (ret > 0) { if (len > 0) Q_LIVE is cleared _way_ earlier than the queue_mutex, so by the time nvme_tcp_free_queue() is called the bit is already cleared. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock 2024-09-12 9:16 ` Hannes Reinecke @ 2024-09-12 10:20 ` Sagi Grimberg 2024-09-12 13:28 ` Shinichiro Kawasaki 0 siblings, 1 reply; 7+ messages in thread From: Sagi Grimberg @ 2024-09-12 10:20 UTC (permalink / raw) To: Hannes Reinecke, Shin'ichiro Kawasaki, linux-nvme, Keith Busch Cc: Akinobu Mita, Martin Belanger, Christoph Hellwig On 12/09/2024 12:16, Hannes Reinecke wrote: > On 9/12/24 08:59, Sagi Grimberg wrote: >> >> >> >> On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote: >>> Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during >>> error recovery") added a mutex_lock() call for the queue->queue_lock >>> in nvme_tcp_get_address(). However, the mutex_lock() races with >>> mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below. >>> >>> DEBUG_LOCKS_WARN_ON(lock->magic != lock) >>> WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 >>> __mutex_lock+0xcf0/0x1220 >>> Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm >>> ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib >>> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct >>> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 >>> ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs >>> parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink >>> zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm >>> sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth >>> serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last >>> unloaded: ib_uverbs] >>> CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319 >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> 1.16.3-2.fc40 04/01/2014 >>> RIP: 0010:__mutex_lock+0xcf0/0x1220 >>> Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 >>> ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> >>> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 >>> RSP: 0018:ffff88811305f760 EFLAGS: 00010286 >>> RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000 >>> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 >>> RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341 >>> R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000 >>> R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058 >>> FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000) >>> knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> Call Trace: >>> <TASK> >>> ? __warn.cold+0x5b/0x1af >>> ? __mutex_lock+0xcf0/0x1220 >>> ? report_bug+0x1ec/0x390 >>> ? handle_bug+0x3c/0x80 >>> ? exc_invalid_op+0x13/0x40 >>> ? asm_exc_invalid_op+0x16/0x20 >>> ? __mutex_lock+0xcf0/0x1220 >>> ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] >>> ? __pfx___mutex_lock+0x10/0x10 >>> ? __lock_acquire+0xd6a/0x59e0 >>> ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] >>> nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] >>> ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp] >>> nvme_sysfs_show_address+0x81/0xc0 [nvme_core] >>> dev_attr_show+0x42/0x80 >>> ? __asan_memset+0x1f/0x40 >>> sysfs_kf_seq_show+0x1f0/0x370 >>> seq_read_iter+0x2cb/0x1130 >>> ? rw_verify_area+0x3b1/0x590 >>> ? __mutex_lock+0x433/0x1220 >>> vfs_read+0x6a6/0xa20 >>> ? lockdep_hardirqs_on+0x78/0x100 >>> ? __pfx_vfs_read+0x10/0x10 >>> ksys_read+0xf7/0x1d0 >>> ? __pfx_ksys_read+0x10/0x10 >>> ? __x64_sys_openat+0x105/0x1d0 >>> do_syscall_64+0x93/0x180 >>> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on+0x78/0x100 >>> ? do_syscall_64+0x9f/0x180 >>> ? __pfx_ksys_read+0x10/0x10 >>> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on+0x78/0x100 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on+0x78/0x100 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on+0x78/0x100 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on_prepare+0x16d/0x400 >>> ? do_syscall_64+0x9f/0x180 >>> ? lockdep_hardirqs_on+0x78/0x100 >>> ? do_syscall_64+0x9f/0x180 >>> ? do_syscall_64+0x9f/0x180 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> RIP: 0033:0x7f9713f55cfa >>> Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 >>> 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> >>> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b >>> RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 >>> RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa >>> RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011 >>> RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff >>> R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148 >>> R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0 >>> </TASK> >>> >>> The WARN is observed when the blktests test case nvme/014 is repeated >>> with tcp transport. It is rare, and 200 times repeat is required to >>> recreate in my test environment. >>> >>> To avoid the WARN, serialize the two functions nvme_tcp_get_address() >>> and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it >>> when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the >>> flag cleared so that it does not run until nvme_tcp_get_address() ends. >>> Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address() >>> starts. This flag is cleared at the nvme_tcp_free_queue() start, >>> then it >>> ensures that nvme_tcp_get_address() is not executed while >>> nvme_tcp_free_queue() runs. >>> >>> Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in >>> same manner as nvme_tcp_get_address(). However, >>> nvme_tcp_stop_queue() is >>> called in sequence before nvme_tcp_free_queue(), then it is safe to >>> assume that the queue->queue_lock is not yet destroyed at >>> nvme_tcp_stop_queue() call. Therefore this fix does not touch >>> nvme_tcp_stop_queue(). >>> >>> Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during >>> error recovery") >>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> >>> --- >>> drivers/nvme/host/tcp.c | 16 +++++++++++++++- >>> 1 file changed, 15 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >>> index a2a47d3ab99f..1457ffda207b 100644 >>> --- a/drivers/nvme/host/tcp.c >>> +++ b/drivers/nvme/host/tcp.c >>> @@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags { >>> NVME_TCP_Q_ALLOCATED = 0, >>> NVME_TCP_Q_LIVE = 1, >>> NVME_TCP_Q_POLLING = 2, >>> + NVME_TCP_Q_REFERRED = 3, >>> }; >>> enum nvme_tcp_recv_state { >>> @@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct >>> nvme_ctrl *nctrl, int qid) >>> if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) >>> return; >>> + wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED, >>> + TASK_UNINTERRUPTIBLE); >>> + >>> if (queue->hdr_digest || queue->data_digest) >>> nvme_tcp_free_crypto(queue); >>> @@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct >>> nvme_ctrl *ctrl, char *buf, int size) >>> { >>> struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0]; >>> struct sockaddr_storage src_addr; >>> - int ret, len; >>> + int ret, len = 0; >>> + >>> + /* Serialize this function and nvme_tcp_free_queue() */ >>> + if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags)) >>> + return 0; >>> + >>> + if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) >>> + goto out; >>> len = nvmf_get_address(ctrl, buf, size); >>> @@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct >>> nvme_ctrl *ctrl, char *buf, int size) >>> done: >>> mutex_unlock(&queue->queue_lock); >>> +out: >>> + clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags); >>> + >>> return len; >>> } >> >> This is really not needed.. You can already serialize with: >> -- >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >> index a2a47d3ab99f..91c19d06cae9 100644 >> --- a/drivers/nvme/host/tcp.c >> +++ b/drivers/nvme/host/tcp.c >> @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct >> nvme_ctrl *nctrl, int qid) >> if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) >> return; >> >> + /* make sure any concurrent .get_address completes */ >> + mutex_lock(&queue->queue_lock); >> + mutex_unlock(&queue->queue_lock); >> + >> if (queue->hdr_digest || queue->data_digest) >> nvme_tcp_free_crypto(queue); > > Wouldn't it be sufficient to check for NVME_TCP_Q_LIVE outside of the > mutex? I think its intentional, so this serializes against nvme_tcp_stop_queue() ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock 2024-09-12 10:20 ` Sagi Grimberg @ 2024-09-12 13:28 ` Shinichiro Kawasaki 2024-10-01 9:38 ` Shinichiro Kawasaki 0 siblings, 1 reply; 7+ messages in thread From: Shinichiro Kawasaki @ 2024-09-12 13:28 UTC (permalink / raw) To: Sagi Grimberg Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch, Akinobu Mita, Martin Belanger, hch@infradead.org On Sep 12, 2024 / 13:20, Sagi Grimberg wrote: > > > > On 12/09/2024 12:16, Hannes Reinecke wrote: > > On 9/12/24 08:59, Sagi Grimberg wrote: > > > > > > > > > > > > On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote: [...] > > > > > > This is really not needed.. You can already serialize with: > > > -- > > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > > > index a2a47d3ab99f..91c19d06cae9 100644 > > > --- a/drivers/nvme/host/tcp.c > > > +++ b/drivers/nvme/host/tcp.c > > > @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct > > > nvme_ctrl *nctrl, int qid) > > > if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) > > > return; > > > > > > + /* make sure any concurrent .get_address completes */ > > > + mutex_lock(&queue->queue_lock); > > > + mutex_unlock(&queue->queue_lock); > > > + > > > if (queue->hdr_digest || queue->data_digest) > > > nvme_tcp_free_crypto(queue); I applied this change above only on top of the kernel v6.11-rc7. Unfortunately, the WARN was still observed. > > > > Wouldn't it be sufficient to check for NVME_TCP_Q_LIVE outside of the > > mutex? > > I think its intentional, so this serializes against nvme_tcp_stop_queue() IIUC, NVME_TCP_Q_LIVE in the mutex region is required in nvme_tcp_free_queue(), and we can not simply move it out of the region. As a similar idea, I can think of doing the NVME_TCP_Q_LIVE check both in and out of the mutex region. The change below does it by adding the check out of the region: diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index a2a47d3ab99f..ba91f98e90f7 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2617,6 +2617,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) len = nvmf_get_address(ctrl, buf, size); + if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) + return len; + mutex_lock(&queue->queue_lock); if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) With this change, I repeated the test case 400 times and the WARN was not observed. Good, this fix might be enough. Actually, once I had tried the similar fix using the NMVE_TCP_Q_ALLOCATED flag instead of nvme_TCP_Q_LIVE. In that case, the WARN had not been observed during the 400 times repeat. Both showed good result, but as Hannes noted, I agree that the check by NVME_TCP_Q_LIVE is the better because NVME_TCP_Q_LIVE is cleared earlier than NVME_TCP_Q_ALLOCATED, and there will be less chance the queue_lock gets destroyed before the mutex_lock(). Having said that, I still think there is a very tiny possibility that the queue_lock gets destroyed between the NVME_TCP_Q_LIVE flag check and the mutex_lock(). That is why I suggested the new flag NVME_TCP_Q_REFERRED. Am I overthinking? ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock 2024-09-12 13:28 ` Shinichiro Kawasaki @ 2024-10-01 9:38 ` Shinichiro Kawasaki 2024-10-01 13:30 ` Hannes Reinecke 0 siblings, 1 reply; 7+ messages in thread From: Shinichiro Kawasaki @ 2024-10-01 9:38 UTC (permalink / raw) To: Hannes Reinecke Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, Keith Busch, Akinobu Mita, Martin Belanger, hch@infradead.org On Sep 12, 2024 / 13:28, Shinichiro Kawasaki wrote: [...] > Having said that, I still think there is a very tiny possibility that the > queue_lock gets destroyed between the NVME_TCP_Q_LIVE flag check and the > mutex_lock(). That is why I suggested the new flag NVME_TCP_Q_REFERRED. > Am I overthinking? I thought about this again, now I think the patch I suggested is too much. The fix Hannes suggested is much simpler, and I confirmed it avoids the failure that I observed. Now I hope this fix to be applied. Hannes, May I send the fix as a formal patch on behalf of you? The fix will be as follows, and I will put your authorship to it. diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 89c44413c593..3e416af2659f 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2644,10 +2644,11 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) len = nvmf_get_address(ctrl, buf, size); + if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) + return len; + mutex_lock(&queue->queue_lock); - if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) - goto done; ret = kernel_getsockname(queue->sock, (struct sockaddr *)&src_addr); if (ret > 0) { if (len > 0) @@ -2655,7 +2656,7 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) len += scnprintf(buf + len, size - len, "%ssrc_addr=%pISc\n", (len) ? "," : "", &src_addr); } -done: + mutex_unlock(&queue->queue_lock); return len; ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock 2024-10-01 9:38 ` Shinichiro Kawasaki @ 2024-10-01 13:30 ` Hannes Reinecke 0 siblings, 0 replies; 7+ messages in thread From: Hannes Reinecke @ 2024-10-01 13:30 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, Keith Busch, Akinobu Mita, Martin Belanger, hch@infradead.org On 10/1/24 11:38, Shinichiro Kawasaki wrote: > On Sep 12, 2024 / 13:28, Shinichiro Kawasaki wrote: > [...] >> Having said that, I still think there is a very tiny possibility that the >> queue_lock gets destroyed between the NVME_TCP_Q_LIVE flag check and the >> mutex_lock(). That is why I suggested the new flag NVME_TCP_Q_REFERRED. >> Am I overthinking? > > I thought about this again, now I think the patch I suggested is too much. The > fix Hannes suggested is much simpler, and I confirmed it avoids the failure that > I observed. Now I hope this fix to be applied. > > Hannes, > > May I send the fix as a formal patch on behalf of you? The fix will be as > follows, and I will put your authorship to it. > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 89c44413c593..3e416af2659f 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -2644,10 +2644,11 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) > > len = nvmf_get_address(ctrl, buf, size); > > + if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) > + return len; > + > mutex_lock(&queue->queue_lock); > > - if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) > - goto done; > ret = kernel_getsockname(queue->sock, (struct sockaddr *)&src_addr); > if (ret > 0) { > if (len > 0) > @@ -2655,7 +2656,7 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) > len += scnprintf(buf + len, size - len, "%ssrc_addr=%pISc\n", > (len) ? "," : "", &src_addr); > } > -done: > + > mutex_unlock(&queue->queue_lock); > > return len; Sure, you're welcome to send it :-) Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-10-01 14:32 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-12 6:27 [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock Shin'ichiro Kawasaki 2024-09-12 6:59 ` Sagi Grimberg 2024-09-12 9:16 ` Hannes Reinecke 2024-09-12 10:20 ` Sagi Grimberg 2024-09-12 13:28 ` Shinichiro Kawasaki 2024-10-01 9:38 ` Shinichiro Kawasaki 2024-10-01 13:30 ` Hannes Reinecke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox