[PATCH] nvme: tcp: avoid race between lock and destroy of queue

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
@ 2024-09-12  6:27 Shin'ichiro Kawasaki
  2024-09-12  6:59 ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Shin'ichiro Kawasaki @ 2024-09-12  6:27 UTC (permalink / raw)
  To: linux-nvme, Keith Busch
  Cc: Akinobu Mita, Hannes Reinecke, Martin Belanger, Christoph Hellwig,
	Sagi Grimberg, Shin'ichiro Kawasaki

Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during
error recovery") added a mutex_lock() call for the queue->queue_lock
in nvme_tcp_get_address(). However, the mutex_lock() races with
mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.

DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220
Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs]
CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:__mutex_lock+0xcf0/0x1220
Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
RSP: 0018:ffff88811305f760 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
FS:  00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? __warn.cold+0x5b/0x1af
 ? __mutex_lock+0xcf0/0x1220
 ? report_bug+0x1ec/0x390
 ? handle_bug+0x3c/0x80
 ? exc_invalid_op+0x13/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? __mutex_lock+0xcf0/0x1220
 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 ? __pfx___mutex_lock+0x10/0x10
 ? __lock_acquire+0xd6a/0x59e0
 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
 nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
 dev_attr_show+0x42/0x80
 ? __asan_memset+0x1f/0x40
 sysfs_kf_seq_show+0x1f0/0x370
 seq_read_iter+0x2cb/0x1130
 ? rw_verify_area+0x3b1/0x590
 ? __mutex_lock+0x433/0x1220
 vfs_read+0x6a6/0xa20
 ? lockdep_hardirqs_on+0x78/0x100
 ? __pfx_vfs_read+0x10/0x10
 ksys_read+0xf7/0x1d0
 ? __pfx_ksys_read+0x10/0x10
 ? __x64_sys_openat+0x105/0x1d0
 do_syscall_64+0x93/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? __pfx_ksys_read+0x10/0x10
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? do_syscall_64+0x9f/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f9713f55cfa
Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
 </TASK>

The WARN is observed when the blktests test case nvme/014 is repeated
with tcp transport. It is rare, and 200 times repeat is required to
recreate in my test environment.

To avoid the WARN, serialize the two functions nvme_tcp_get_address()
and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it
when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the
flag cleared so that it does not run until nvme_tcp_get_address() ends.
Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address()
starts. This flag is cleared at the nvme_tcp_free_queue() start, then it
ensures that nvme_tcp_get_address() is not executed while
nvme_tcp_free_queue() runs.

Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in
same manner as nvme_tcp_get_address(). However, nvme_tcp_stop_queue() is
called in sequence before nvme_tcp_free_queue(), then it is safe to
assume that the queue->queue_lock is not yet destroyed at
nvme_tcp_stop_queue() call. Therefore this fix does not touch
nvme_tcp_stop_queue().

Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during error recovery")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 drivers/nvme/host/tcp.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a2a47d3ab99f..1457ffda207b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_ALLOCATED	= 0,
 	NVME_TCP_Q_LIVE		= 1,
 	NVME_TCP_Q_POLLING	= 2,
+	NVME_TCP_Q_REFERRED	= 3,
 };
 
 enum nvme_tcp_recv_state {
@@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
 		return;
 
+	wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED,
+		       TASK_UNINTERRUPTIBLE);
+
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
@@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 {
 	struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0];
 	struct sockaddr_storage src_addr;
-	int ret, len;
+	int ret, len = 0;
+
+	/* Serialize this function and nvme_tcp_free_queue() */
+	if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags))
+		return 0;
+
+	if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
+		goto out;
 
 	len = nvmf_get_address(ctrl, buf, size);
 
@@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 done:
 	mutex_unlock(&queue->queue_lock);
 
+out:
+	clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags);
+
 	return len;
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
  2024-09-12  6:27 [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock Shin'ichiro Kawasaki
@ 2024-09-12  6:59 ` Sagi Grimberg
  2024-09-12  9:16   ` Hannes Reinecke
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2024-09-12  6:59 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki, linux-nvme, Keith Busch
  Cc: Akinobu Mita, Hannes Reinecke, Martin Belanger, Christoph Hellwig




On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote:
> Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during
> error recovery") added a mutex_lock() call for the queue->queue_lock
> in nvme_tcp_get_address(). However, the mutex_lock() races with
> mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.
>
> DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220
> Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs]
> CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
> RIP: 0010:__mutex_lock+0xcf0/0x1220
> Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
> RSP: 0018:ffff88811305f760 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
> RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
> R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
> R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
> FS:  00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>   <TASK>
>   ? __warn.cold+0x5b/0x1af
>   ? __mutex_lock+0xcf0/0x1220
>   ? report_bug+0x1ec/0x390
>   ? handle_bug+0x3c/0x80
>   ? exc_invalid_op+0x13/0x40
>   ? asm_exc_invalid_op+0x16/0x20
>   ? __mutex_lock+0xcf0/0x1220
>   ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>   ? __pfx___mutex_lock+0x10/0x10
>   ? __lock_acquire+0xd6a/0x59e0
>   ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>   nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>   ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
>   nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
>   dev_attr_show+0x42/0x80
>   ? __asan_memset+0x1f/0x40
>   sysfs_kf_seq_show+0x1f0/0x370
>   seq_read_iter+0x2cb/0x1130
>   ? rw_verify_area+0x3b1/0x590
>   ? __mutex_lock+0x433/0x1220
>   vfs_read+0x6a6/0xa20
>   ? lockdep_hardirqs_on+0x78/0x100
>   ? __pfx_vfs_read+0x10/0x10
>   ksys_read+0xf7/0x1d0
>   ? __pfx_ksys_read+0x10/0x10
>   ? __x64_sys_openat+0x105/0x1d0
>   do_syscall_64+0x93/0x180
>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on+0x78/0x100
>   ? do_syscall_64+0x9f/0x180
>   ? __pfx_ksys_read+0x10/0x10
>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on+0x78/0x100
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on+0x78/0x100
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on+0x78/0x100
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on+0x78/0x100
>   ? do_syscall_64+0x9f/0x180
>   ? do_syscall_64+0x9f/0x180
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f9713f55cfa
> Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
> RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
> RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
> RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
> R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
> R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
>   </TASK>
>
> The WARN is observed when the blktests test case nvme/014 is repeated
> with tcp transport. It is rare, and 200 times repeat is required to
> recreate in my test environment.
>
> To avoid the WARN, serialize the two functions nvme_tcp_get_address()
> and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it
> when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the
> flag cleared so that it does not run until nvme_tcp_get_address() ends.
> Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address()
> starts. This flag is cleared at the nvme_tcp_free_queue() start, then it
> ensures that nvme_tcp_get_address() is not executed while
> nvme_tcp_free_queue() runs.
>
> Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in
> same manner as nvme_tcp_get_address(). However, nvme_tcp_stop_queue() is
> called in sequence before nvme_tcp_free_queue(), then it is safe to
> assume that the queue->queue_lock is not yet destroyed at
> nvme_tcp_stop_queue() call. Therefore this fix does not touch
> nvme_tcp_stop_queue().
>
> Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during error recovery")
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>   drivers/nvme/host/tcp.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a2a47d3ab99f..1457ffda207b 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags {
>   	NVME_TCP_Q_ALLOCATED	= 0,
>   	NVME_TCP_Q_LIVE		= 1,
>   	NVME_TCP_Q_POLLING	= 2,
> +	NVME_TCP_Q_REFERRED	= 3,
>   };
>   
>   enum nvme_tcp_recv_state {
> @@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
>   	if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>   		return;
>   
> +	wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED,
> +		       TASK_UNINTERRUPTIBLE);
> +
>   	if (queue->hdr_digest || queue->data_digest)
>   		nvme_tcp_free_crypto(queue);
>   
> @@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
>   {
>   	struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0];
>   	struct sockaddr_storage src_addr;
> -	int ret, len;
> +	int ret, len = 0;
> +
> +	/* Serialize this function and nvme_tcp_free_queue() */
> +	if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags))
> +		return 0;
> +
> +	if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
> +		goto out;
>   
>   	len = nvmf_get_address(ctrl, buf, size);
>   
> @@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
>   done:
>   	mutex_unlock(&queue->queue_lock);
>   
> +out:
> +	clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags);
> +
>   	return len;
>   }
>   

This is really not needed.. You can already serialize with:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a2a47d3ab99f..91c19d06cae9 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct nvme_ctrl 
*nctrl, int qid)
         if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
                 return;

+       /* make sure any concurrent .get_address completes */
+       mutex_lock(&queue->queue_lock);
+       mutex_unlock(&queue->queue_lock);
+
         if (queue->hdr_digest || queue->data_digest)
                 nvme_tcp_free_crypto(queue);
-- 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
  2024-09-12  6:59 ` Sagi Grimberg
@ 2024-09-12  9:16   ` Hannes Reinecke
  2024-09-12 10:20     ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2024-09-12  9:16 UTC (permalink / raw)
  To: Sagi Grimberg, Shin'ichiro Kawasaki, linux-nvme, Keith Busch
  Cc: Akinobu Mita, Martin Belanger, Christoph Hellwig

On 9/12/24 08:59, Sagi Grimberg wrote:
> 
> 
> 
> On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote:
>> Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during
>> error recovery") added a mutex_lock() call for the queue->queue_lock
>> in nvme_tcp_get_address(). However, the mutex_lock() races with
>> mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.
>>
>> DEBUG_LOCKS_WARN_ON(lock->magic != lock)
>> WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 
>> __mutex_lock+0xcf0/0x1220
>> Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm 
>> ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
>> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
>> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set 
>> nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc 
>> parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs 
>> drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx 
>> floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw 
>> ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs]
>> CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.16.3-2.fc40 04/01/2014
>> RIP: 0010:__mutex_lock+0xcf0/0x1220
>> Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 
>> ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b 
>> e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
>> RSP: 0018:ffff88811305f760 EFLAGS: 00010286
>> RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
>> RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
>> R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
>> R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
>> FS:  00007f9713ae4980(0000) GS:ffff8883ae180000(0000) 
>> knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>   <TASK>
>>   ? __warn.cold+0x5b/0x1af
>>   ? __mutex_lock+0xcf0/0x1220
>>   ? report_bug+0x1ec/0x390
>>   ? handle_bug+0x3c/0x80
>>   ? exc_invalid_op+0x13/0x40
>>   ? asm_exc_invalid_op+0x16/0x20
>>   ? __mutex_lock+0xcf0/0x1220
>>   ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>>   ? __pfx___mutex_lock+0x10/0x10
>>   ? __lock_acquire+0xd6a/0x59e0
>>   ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>>   nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>>   ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
>>   nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
>>   dev_attr_show+0x42/0x80
>>   ? __asan_memset+0x1f/0x40
>>   sysfs_kf_seq_show+0x1f0/0x370
>>   seq_read_iter+0x2cb/0x1130
>>   ? rw_verify_area+0x3b1/0x590
>>   ? __mutex_lock+0x433/0x1220
>>   vfs_read+0x6a6/0xa20
>>   ? lockdep_hardirqs_on+0x78/0x100
>>   ? __pfx_vfs_read+0x10/0x10
>>   ksys_read+0xf7/0x1d0
>>   ? __pfx_ksys_read+0x10/0x10
>>   ? __x64_sys_openat+0x105/0x1d0
>>   do_syscall_64+0x93/0x180
>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on+0x78/0x100
>>   ? do_syscall_64+0x9f/0x180
>>   ? __pfx_ksys_read+0x10/0x10
>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on+0x78/0x100
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on+0x78/0x100
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on+0x78/0x100
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>   ? do_syscall_64+0x9f/0x180
>>   ? lockdep_hardirqs_on+0x78/0x100
>>   ? do_syscall_64+0x9f/0x180
>>   ? do_syscall_64+0x9f/0x180
>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> RIP: 0033:0x7f9713f55cfa
>> Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 
>> 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 
>> 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
>> RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
>> RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
>> RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
>> R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
>> R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
>>   </TASK>
>>
>> The WARN is observed when the blktests test case nvme/014 is repeated
>> with tcp transport. It is rare, and 200 times repeat is required to
>> recreate in my test environment.
>>
>> To avoid the WARN, serialize the two functions nvme_tcp_get_address()
>> and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it
>> when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the
>> flag cleared so that it does not run until nvme_tcp_get_address() ends.
>> Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address()
>> starts. This flag is cleared at the nvme_tcp_free_queue() start, then it
>> ensures that nvme_tcp_get_address() is not executed while
>> nvme_tcp_free_queue() runs.
>>
>> Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in
>> same manner as nvme_tcp_get_address(). However, nvme_tcp_stop_queue() is
>> called in sequence before nvme_tcp_free_queue(), then it is safe to
>> assume that the queue->queue_lock is not yet destroyed at
>> nvme_tcp_stop_queue() call. Therefore this fix does not touch
>> nvme_tcp_stop_queue().
>>
>> Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during 
>> error recovery")
>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> ---
>>   drivers/nvme/host/tcp.c | 16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index a2a47d3ab99f..1457ffda207b 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags {
>>       NVME_TCP_Q_ALLOCATED    = 0,
>>       NVME_TCP_Q_LIVE        = 1,
>>       NVME_TCP_Q_POLLING    = 2,
>> +    NVME_TCP_Q_REFERRED    = 3,
>>   };
>>   enum nvme_tcp_recv_state {
>> @@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct nvme_ctrl 
>> *nctrl, int qid)
>>       if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>           return;
>> +    wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED,
>> +               TASK_UNINTERRUPTIBLE);
>> +
>>       if (queue->hdr_digest || queue->data_digest)
>>           nvme_tcp_free_crypto(queue);
>> @@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct 
>> nvme_ctrl *ctrl, char *buf, int size)
>>   {
>>       struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0];
>>       struct sockaddr_storage src_addr;
>> -    int ret, len;
>> +    int ret, len = 0;
>> +
>> +    /* Serialize this function and nvme_tcp_free_queue() */
>> +    if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags))
>> +        return 0;
>> +
>> +    if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>> +        goto out;
>>       len = nvmf_get_address(ctrl, buf, size);
>> @@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl 
>> *ctrl, char *buf, int size)
>>   done:
>>       mutex_unlock(&queue->queue_lock);
>> +out:
>> +    clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags);
>> +
>>       return len;
>>   }
> 
> This is really not needed.. You can already serialize with:
> -- 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a2a47d3ab99f..91c19d06cae9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct nvme_ctrl 
> *nctrl, int qid)
>          if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>                  return;
> 
> +       /* make sure any concurrent .get_address completes */
> +       mutex_lock(&queue->queue_lock);
> +       mutex_unlock(&queue->queue_lock);
> +
>          if (queue->hdr_digest || queue->data_digest)
>                  nvme_tcp_free_crypto(queue);

Wouldn't it be sufficient to check for NVME_TCP_Q_LIVE outside of the 
mutex?

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 76c8e01d8f08..0db3dde3a478 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2688,10 +2688,11 @@ static int nvme_tcp_get_address(struct nvme_ctrl 
*ctrl, char *buf, int size)

         len = nvmf_get_address(ctrl, buf, size);

+       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
+               return len;
+
         mutex_lock(&queue->queue_lock);

-       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
-               goto done;
         ret = kernel_getsockname(queue->sock, (struct sockaddr 
*)&src_addr);
         if (ret > 0) {
                 if (len > 0)


Q_LIVE is cleared _way_ earlier than the queue_mutex, so by the time 
nvme_tcp_free_queue() is called the bit is already cleared.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
  2024-09-12  9:16   ` Hannes Reinecke
@ 2024-09-12 10:20     ` Sagi Grimberg
  2024-09-12 13:28       ` Shinichiro Kawasaki
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2024-09-12 10:20 UTC (permalink / raw)
  To: Hannes Reinecke, Shin'ichiro Kawasaki, linux-nvme,
	Keith Busch
  Cc: Akinobu Mita, Martin Belanger, Christoph Hellwig




On 12/09/2024 12:16, Hannes Reinecke wrote:
> On 9/12/24 08:59, Sagi Grimberg wrote:
>>
>>
>>
>> On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote:
>>> Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during
>>> error recovery") added a mutex_lock() call for the queue->queue_lock
>>> in nvme_tcp_get_address(). However, the mutex_lock() races with
>>> mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.
>>>
>>> DEBUG_LOCKS_WARN_ON(lock->magic != lock)
>>> WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 
>>> __mutex_lock+0xcf0/0x1220
>>> Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm 
>>> ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
>>> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
>>> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
>>> ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs 
>>> parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink 
>>> zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm 
>>> sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth 
>>> serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last 
>>> unloaded: ib_uverbs]
>>> CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 1.16.3-2.fc40 04/01/2014
>>> RIP: 0010:__mutex_lock+0xcf0/0x1220
>>> Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 
>>> ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 
>>> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
>>> RSP: 0018:ffff88811305f760 EFLAGS: 00010286
>>> RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
>>> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
>>> RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
>>> R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
>>> R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
>>> FS:  00007f9713ae4980(0000) GS:ffff8883ae180000(0000) 
>>> knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>>   <TASK>
>>>   ? __warn.cold+0x5b/0x1af
>>>   ? __mutex_lock+0xcf0/0x1220
>>>   ? report_bug+0x1ec/0x390
>>>   ? handle_bug+0x3c/0x80
>>>   ? exc_invalid_op+0x13/0x40
>>>   ? asm_exc_invalid_op+0x16/0x20
>>>   ? __mutex_lock+0xcf0/0x1220
>>>   ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>>>   ? __pfx___mutex_lock+0x10/0x10
>>>   ? __lock_acquire+0xd6a/0x59e0
>>>   ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>>>   nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>>>   ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
>>>   nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
>>>   dev_attr_show+0x42/0x80
>>>   ? __asan_memset+0x1f/0x40
>>>   sysfs_kf_seq_show+0x1f0/0x370
>>>   seq_read_iter+0x2cb/0x1130
>>>   ? rw_verify_area+0x3b1/0x590
>>>   ? __mutex_lock+0x433/0x1220
>>>   vfs_read+0x6a6/0xa20
>>>   ? lockdep_hardirqs_on+0x78/0x100
>>>   ? __pfx_vfs_read+0x10/0x10
>>>   ksys_read+0xf7/0x1d0
>>>   ? __pfx_ksys_read+0x10/0x10
>>>   ? __x64_sys_openat+0x105/0x1d0
>>>   do_syscall_64+0x93/0x180
>>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on+0x78/0x100
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? __pfx_ksys_read+0x10/0x10
>>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on+0x78/0x100
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on+0x78/0x100
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on+0x78/0x100
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? lockdep_hardirqs_on+0x78/0x100
>>>   ? do_syscall_64+0x9f/0x180
>>>   ? do_syscall_64+0x9f/0x180
>>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> RIP: 0033:0x7f9713f55cfa
>>> Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 
>>> 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 
>>> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
>>> RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>>> RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
>>> RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
>>> RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
>>> R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
>>> R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
>>>   </TASK>
>>>
>>> The WARN is observed when the blktests test case nvme/014 is repeated
>>> with tcp transport. It is rare, and 200 times repeat is required to
>>> recreate in my test environment.
>>>
>>> To avoid the WARN, serialize the two functions nvme_tcp_get_address()
>>> and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it
>>> when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the
>>> flag cleared so that it does not run until nvme_tcp_get_address() ends.
>>> Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address()
>>> starts. This flag is cleared at the nvme_tcp_free_queue() start, 
>>> then it
>>> ensures that nvme_tcp_get_address() is not executed while
>>> nvme_tcp_free_queue() runs.
>>>
>>> Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in
>>> same manner as nvme_tcp_get_address(). However, 
>>> nvme_tcp_stop_queue() is
>>> called in sequence before nvme_tcp_free_queue(), then it is safe to
>>> assume that the queue->queue_lock is not yet destroyed at
>>> nvme_tcp_stop_queue() call. Therefore this fix does not touch
>>> nvme_tcp_stop_queue().
>>>
>>> Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during 
>>> error recovery")
>>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>> ---
>>>   drivers/nvme/host/tcp.c | 16 +++++++++++++++-
>>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index a2a47d3ab99f..1457ffda207b 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags {
>>>       NVME_TCP_Q_ALLOCATED    = 0,
>>>       NVME_TCP_Q_LIVE        = 1,
>>>       NVME_TCP_Q_POLLING    = 2,
>>> +    NVME_TCP_Q_REFERRED    = 3,
>>>   };
>>>   enum nvme_tcp_recv_state {
>>> @@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct 
>>> nvme_ctrl *nctrl, int qid)
>>>       if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>>           return;
>>> +    wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED,
>>> +               TASK_UNINTERRUPTIBLE);
>>> +
>>>       if (queue->hdr_digest || queue->data_digest)
>>>           nvme_tcp_free_crypto(queue);
>>> @@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct 
>>> nvme_ctrl *ctrl, char *buf, int size)
>>>   {
>>>       struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0];
>>>       struct sockaddr_storage src_addr;
>>> -    int ret, len;
>>> +    int ret, len = 0;
>>> +
>>> +    /* Serialize this function and nvme_tcp_free_queue() */
>>> +    if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags))
>>> +        return 0;
>>> +
>>> +    if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>> +        goto out;
>>>       len = nvmf_get_address(ctrl, buf, size);
>>> @@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct 
>>> nvme_ctrl *ctrl, char *buf, int size)
>>>   done:
>>>       mutex_unlock(&queue->queue_lock);
>>> +out:
>>> +    clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags);
>>> +
>>>       return len;
>>>   }
>>
>> This is really not needed.. You can already serialize with:
>> -- 
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index a2a47d3ab99f..91c19d06cae9 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct 
>> nvme_ctrl *nctrl, int qid)
>>          if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>                  return;
>>
>> +       /* make sure any concurrent .get_address completes */
>> +       mutex_lock(&queue->queue_lock);
>> +       mutex_unlock(&queue->queue_lock);
>> +
>>          if (queue->hdr_digest || queue->data_digest)
>>                  nvme_tcp_free_crypto(queue);
>
> Wouldn't it be sufficient to check for NVME_TCP_Q_LIVE outside of the 
> mutex?

I think its intentional, so this serializes against nvme_tcp_stop_queue()


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
  2024-09-12 10:20     ` Sagi Grimberg
@ 2024-09-12 13:28       ` Shinichiro Kawasaki
  2024-10-01  9:38         ` Shinichiro Kawasaki
  0 siblings, 1 reply; 7+ messages in thread
From: Shinichiro Kawasaki @ 2024-09-12 13:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch,
	Akinobu Mita, Martin Belanger, hch@infradead.org

On Sep 12, 2024 / 13:20, Sagi Grimberg wrote:
> 
> 
> 
> On 12/09/2024 12:16, Hannes Reinecke wrote:
> > On 9/12/24 08:59, Sagi Grimberg wrote:
> > > 
> > > 
> > > 
> > > On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote:
[...]
> > > 
> > > This is really not needed.. You can already serialize with:
> > > -- 
> > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > > index a2a47d3ab99f..91c19d06cae9 100644
> > > --- a/drivers/nvme/host/tcp.c
> > > +++ b/drivers/nvme/host/tcp.c
> > > @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct
> > > nvme_ctrl *nctrl, int qid)
> > >          if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
> > >                  return;
> > > 
> > > +       /* make sure any concurrent .get_address completes */
> > > +       mutex_lock(&queue->queue_lock);
> > > +       mutex_unlock(&queue->queue_lock);
> > > +
> > >          if (queue->hdr_digest || queue->data_digest)
> > >                  nvme_tcp_free_crypto(queue);

I applied this change above only on top of the kernel v6.11-rc7. Unfortunately,
the WARN was still observed.

> > 
> > Wouldn't it be sufficient to check for NVME_TCP_Q_LIVE outside of the
> > mutex?
> 
> I think its intentional, so this serializes against nvme_tcp_stop_queue()

IIUC, NVME_TCP_Q_LIVE in the mutex region is required in nvme_tcp_free_queue(),
and we can not simply move it out of the region.

As a similar idea, I can think of doing the NVME_TCP_Q_LIVE check both in and
out of the mutex region. The change below does it by adding the check out of the
region:

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a2a47d3ab99f..ba91f98e90f7 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2617,6 +2617,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)

 	len = nvmf_get_address(ctrl, buf, size);

+	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
+		return len;
+
 	mutex_lock(&queue->queue_lock);

 	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))

With this change, I repeated the test case 400 times and the WARN was not
observed. Good, this fix might be enough.

Actually, once I had tried the similar fix using the NMVE_TCP_Q_ALLOCATED
flag instead of nvme_TCP_Q_LIVE. In that case, the WARN had not been observed
during the 400 times repeat. Both showed good result, but as Hannes noted, I
agree that the check by NVME_TCP_Q_LIVE is the better because NVME_TCP_Q_LIVE
is cleared earlier than NVME_TCP_Q_ALLOCATED, and there will be less chance the
queue_lock gets destroyed before the mutex_lock().

Having said that, I still think there is a very tiny possibility that the
queue_lock gets destroyed between the NVME_TCP_Q_LIVE flag check and the
mutex_lock(). That is why I suggested the new flag NVME_TCP_Q_REFERRED.
Am I overthinking?

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
  2024-09-12 13:28       ` Shinichiro Kawasaki
@ 2024-10-01  9:38         ` Shinichiro Kawasaki
  2024-10-01 13:30           ` Hannes Reinecke
  0 siblings, 1 reply; 7+ messages in thread
From: Shinichiro Kawasaki @ 2024-10-01  9:38 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, Keith Busch,
	Akinobu Mita, Martin Belanger, hch@infradead.org

On Sep 12, 2024 / 13:28, Shinichiro Kawasaki wrote:
[...]
> Having said that, I still think there is a very tiny possibility that the
> queue_lock gets destroyed between the NVME_TCP_Q_LIVE flag check and the
> mutex_lock(). That is why I suggested the new flag NVME_TCP_Q_REFERRED.
> Am I overthinking?

I thought about this again, now I think the patch I suggested is too much. The
fix Hannes suggested is much simpler, and I confirmed it avoids the failure that
I observed. Now I hope this fix to be applied.

Hannes,

May I send the fix as a formal patch on behalf of you? The fix will be as
follows, and I will put your authorship to it.

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 89c44413c593..3e416af2659f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2644,10 +2644,11 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 
 	len = nvmf_get_address(ctrl, buf, size);
 
+	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
+		return len;
+
 	mutex_lock(&queue->queue_lock);
 
-	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
-		goto done;
 	ret = kernel_getsockname(queue->sock, (struct sockaddr *)&src_addr);
 	if (ret > 0) {
 		if (len > 0)
@@ -2655,7 +2656,7 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 		len += scnprintf(buf + len, size - len, "%ssrc_addr=%pISc\n",
 				(len) ? "," : "", &src_addr);
 	}
-done:
+
 	mutex_unlock(&queue->queue_lock);
 
 	return len;


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
  2024-10-01  9:38         ` Shinichiro Kawasaki
@ 2024-10-01 13:30           ` Hannes Reinecke
  0 siblings, 0 replies; 7+ messages in thread
From: Hannes Reinecke @ 2024-10-01 13:30 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, Keith Busch,
	Akinobu Mita, Martin Belanger, hch@infradead.org

On 10/1/24 11:38, Shinichiro Kawasaki wrote:
> On Sep 12, 2024 / 13:28, Shinichiro Kawasaki wrote:
> [...]
>> Having said that, I still think there is a very tiny possibility that the
>> queue_lock gets destroyed between the NVME_TCP_Q_LIVE flag check and the
>> mutex_lock(). That is why I suggested the new flag NVME_TCP_Q_REFERRED.
>> Am I overthinking?
> 
> I thought about this again, now I think the patch I suggested is too much. The
> fix Hannes suggested is much simpler, and I confirmed it avoids the failure that
> I observed. Now I hope this fix to be applied.
> 
> Hannes,
> 
> May I send the fix as a formal patch on behalf of you? The fix will be as
> follows, and I will put your authorship to it.
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 89c44413c593..3e416af2659f 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2644,10 +2644,11 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
>   
>   	len = nvmf_get_address(ctrl, buf, size);
>   
> +	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
> +		return len;
> +
>   	mutex_lock(&queue->queue_lock);
>   
> -	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
> -		goto done;
>   	ret = kernel_getsockname(queue->sock, (struct sockaddr *)&src_addr);
>   	if (ret > 0) {
>   		if (len > 0)
> @@ -2655,7 +2656,7 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
>   		len += scnprintf(buf + len, size - len, "%ssrc_addr=%pISc\n",
>   				(len) ? "," : "", &src_addr);
>   	}
> -done:
> +
>   	mutex_unlock(&queue->queue_lock);
>   
>   	return len;

Sure, you're welcome to send it :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-10-01 14:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-12  6:27 [PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock Shin'ichiro Kawasaki
2024-09-12  6:59 ` Sagi Grimberg
2024-09-12  9:16   ` Hannes Reinecke
2024-09-12 10:20     ` Sagi Grimberg
2024-09-12 13:28       ` Shinichiro Kawasaki
2024-10-01  9:38         ` Shinichiro Kawasaki
2024-10-01 13:30           ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox