From: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
To: kbusch@kernel.org, hch@lst.de, hare@suse.de, sagi@grimberg.me,
axboe@kernel.dk, dlemoal@kernel.org, wagi@kernel.org,
mpatocka@redhat.com, yukuai3@huawei.com, xni@redhat.com,
linan122@huawei.com, bmarzins@redhat.com,
john.g.garry@oracle.com, edumazet@google.com,
ncardwell@google.com, kuniyu@google.com, davem@davemloft.net,
dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com,
horms@kernel.org
Cc: netdev@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org,
Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Subject: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
Date: Mon, 24 Nov 2025 22:11:42 -0800 [thread overview]
Message-ID: <20251125061142.18094-1-ckulkarnilinux@gmail.com> (raw)
tcp_disconnect() calls tcp_send_active_reset() with gfp_any(), which
returns GFP_KERNEL in process context. This can trigger a circular
locking dependency when called during block device teardown that
involves network-backed storage.
The deadlock scenario occurs with storage configurations like MD RAID
over NVMeOF TCP when tearing down the block device:
CPU0 (mdadm --stop /dev/mdX): CPU1 (NVMe I/O submission):
================================ ===========================
del_gendisk()
blk_unregister_queue()
elevator_set_none()
elevator_switch()
__synchronize_srcu()
[holds set->srcu]
[waits for operations]
nvme_tcp_queue_rq()
nvme_tcp_try_send()
tcp_sendmsg()
lock_sock_nested()
[holds sk_lock-AF_INET-NVME]
[can wait for set->srcu]
[cleanup triggers NVMe disconnect]
nvme_tcp_teardown_io_queues()
nvme_tcp_free_queue()
sock_release()
__sock_release()
tcp_close()
lock_sock_nested()
[holds sk_lock-AF_INET-NVME]
__tcp_close()
tcp_disconnect()
tcp_send_active_reset()
alloc_skb(gfp_any())
[GFP_KERNEL in process context]
kmem_cache_alloc_node()
fs_reclaim_acquire()
[can trigger writeback]
[needs block layer]
[waits for set->srcu]
*** DEADLOCK ***
blktests ./check md/001:
[ 95.764798] run blktests md/001 at 2025-11-24 21:13:10
[ 96.020965] brd: module loaded
[ 96.098934] Key type psk registered
[ 96.237974] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 96.244988] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 96.286775] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 96.290980] nvme nvme0: creating 48 I/O queues.
[ 96.304554] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 96.322530] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 96.414331] md: async del_gendisk mode will be removed in future, please upgrade to mdadm-4.5+
[ 96.414427] block device autoloading is deprecated and will be removed.
[ 96.473347] md/raid1:md127: active with 1 out of 2 mirrors
[ 96.474602] md127: detected capacity change from 0 to 2093056
[ 96.665424] md127: detected capacity change from 2093056 to 0
[ 96.665433] md: md127 stopped.
[ 96.694365] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 96.708310] block nvme0n1: no available path - failing I/O
[ 96.708379] block nvme0n1: no available path - failing I/O
[ 96.708414] block nvme0n1: no available path - failing I/O
[ 96.708734] block nvme0n1: no available path - failing I/O
[ 96.708745] block nvme0n1: no available path - failing I/O
[ 96.708761] block nvme0n1: no available path - failing I/O
[ 96.812432] ======================================================
[ 96.816828] WARNING: possible circular locking dependency detected
[ 96.821054] 6.18.0-rc6lblk-fnext+ #7 Tainted: G N
[ 96.825312] ------------------------------------------------------
[ 96.830181] nvme/2595 is trying to acquire lock:
[ 96.833374] ffffffff82e487e0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x5a/0x770
[ 96.839640]
but task is already holding lock:
[ 96.843657] ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[ 96.849247]
which lock already depends on the new lock.
[ 96.854869]
the existing dependency chain (in reverse order) is:
[ 96.860473]
-> #4 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[ 96.865028] lock_sock_nested+0x2e/0x70
[ 96.868084] tcp_sendmsg+0x1a/0x40
[ 96.870833] sock_sendmsg+0xed/0x110
[ 96.873677] nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
[ 96.878007] nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
[ 96.881344] nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
[ 96.884399] blk_mq_dispatch_rq_list+0x29a/0x800
[ 96.887237] __blk_mq_sched_dispatch_requests+0x3de/0x5f0
[ 96.891116] blk_mq_sched_dispatch_requests+0x29/0x70
[ 96.894166] blk_mq_run_work_fn+0x76/0x1b0
[ 96.896710] process_one_work+0x211/0x630
[ 96.899162] worker_thread+0x184/0x330
[ 96.901503] kthread+0x10d/0x250
[ 96.903570] ret_from_fork+0x29a/0x300
[ 96.905888] ret_from_fork_asm+0x1a/0x30
[ 96.908186]
-> #3 (set->srcu){.+.+}-{0:0}:
[ 96.910188] __synchronize_srcu+0x49/0x170
[ 96.911882] elevator_switch+0xc9/0x330
[ 96.913459] elevator_change+0x133/0x1b0
[ 96.915079] elevator_set_none+0x3b/0x80
[ 96.916714] blk_unregister_queue+0xb0/0x120
[ 96.918450] __del_gendisk+0x14e/0x3c0
[ 96.920700] del_gendisk+0x75/0xa0
[ 96.922098] nvme_ns_remove+0xf2/0x230 [nvme_core]
[ 96.924044] nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[ 96.926220] nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[ 96.928310] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[ 96.930429] nvme_sysfs_delete+0x34/0x40 [nvme_core]
[ 96.932450] kernfs_fop_write_iter+0x16d/0x220
[ 96.934271] vfs_write+0x37b/0x520
[ 96.935746] ksys_write+0x67/0xe0
[ 96.937141] do_syscall_64+0x76/0xa60
[ 96.938645] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 96.940628]
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
[ 96.942903] __mutex_lock+0xa2/0x1150
[ 96.944434] elevator_change+0x9b/0x1b0
[ 96.946046] elv_iosched_store+0x116/0x190
[ 96.947746] kernfs_fop_write_iter+0x16d/0x220
[ 96.949524] vfs_write+0x37b/0x520
[ 96.951506] ksys_write+0x67/0xe0
[ 96.952934] do_syscall_64+0x76/0xa60
[ 96.954457] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 96.956489]
-> #1 (&q->q_usage_counter(io)){++++}-{0:0}:
[ 96.959011] blk_alloc_queue+0x30e/0x350
[ 96.960664] blk_mq_alloc_queue+0x61/0xd0
[ 96.962293] scsi_alloc_sdev+0x2a0/0x3e0
[ 96.963954] scsi_probe_and_add_lun+0x1bd/0x430
[ 96.965782] __scsi_add_device+0x109/0x120
[ 96.967461] ata_scsi_scan_host+0x97/0x1c0
[ 96.969198] async_run_entry_fn+0x30/0x130
[ 96.970903] process_one_work+0x211/0x630
[ 96.972577] worker_thread+0x184/0x330
[ 96.974097] kthread+0x10d/0x250
[ 96.975448] ret_from_fork+0x29a/0x300
[ 96.977050] ret_from_fork_asm+0x1a/0x30
[ 96.978705]
-> #0 (fs_reclaim){+.+.}-{0:0}:
[ 96.981265] __lock_acquire+0x1468/0x2210
[ 96.982950] lock_acquire+0xd3/0x2f0
[ 96.984445] fs_reclaim_acquire+0x99/0xd0
[ 96.986141] kmem_cache_alloc_node_noprof+0x5a/0x770
[ 96.988171] __alloc_skb+0x15f/0x190
[ 96.989681] tcp_send_active_reset+0x3f/0x1e0
[ 96.991248] tcp_disconnect+0x551/0x770
[ 96.992851] __tcp_close+0x2c7/0x520
[ 96.994327] tcp_close+0x20/0x80
[ 96.995727] inet_release+0x34/0x60
[ 96.997168] __sock_release+0x3d/0xc0
[ 96.998688] sock_close+0x14/0x20
[ 97.000058] __fput+0xf1/0x2c0
[ 97.001388] task_work_run+0x58/0x90
[ 97.002922] exit_to_user_mode_loop+0x12c/0x150
[ 97.004720] do_syscall_64+0x2a0/0xa60
[ 97.006256] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 97.008279]
other info that might help us debug this:
[ 97.011827] Chain exists of:
fs_reclaim --> set->srcu --> sk_lock-AF_INET-NVME
[ 97.015506] Possible unsafe locking scenario:
[ 97.017718] CPU0 CPU1
[ 97.019363] ---- ----
[ 97.020984] lock(sk_lock-AF_INET-NVME);
[ 97.022399] lock(set->srcu);
[ 97.024415] lock(sk_lock-AF_INET-NVME);
[ 97.026798] lock(fs_reclaim);
[ 97.027927]
*** DEADLOCK ***
[ 97.030010] 2 locks held by nvme/2595:
[ 97.031353] #0: ffff88810047b388 (&sb->s_type->i_mutex_key#10){+.+.}-{4:4}, at: __sock_release+0x30/0xc0
[ 97.034820] #1: ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[ 97.037806]
stack backtrace:
[ 97.039367] CPU: 2 UID: 0 PID: 2595 Comm: nvme Tainted: G N 6.18.0-rc6lblk-fnext+ #7 PREEMPT(voluntary)
[ 97.039370] Tainted: [N]=TEST
[ 97.039371] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 97.039372] Call Trace:
[ 97.039374] <TASK>
[ 97.039375] dump_stack_lvl+0x75/0xb0
[ 97.039379] print_circular_bug+0x26a/0x330
[ 97.039381] check_noncircular+0x12f/0x150
[ 97.039385] __lock_acquire+0x1468/0x2210
[ 97.039388] lock_acquire+0xd3/0x2f0
[ 97.039390] ? kmem_cache_alloc_node_noprof+0x5a/0x770
[ 97.039393] fs_reclaim_acquire+0x99/0xd0
[ 97.039395] ? kmem_cache_alloc_node_noprof+0x5a/0x770
[ 97.039396] kmem_cache_alloc_node_noprof+0x5a/0x770
[ 97.039397] ? __alloc_skb+0x15f/0x190
[ 97.039400] ? __alloc_skb+0x15f/0x190
[ 97.039401] __alloc_skb+0x15f/0x190
[ 97.039403] tcp_send_active_reset+0x3f/0x1e0
[ 97.039405] tcp_disconnect+0x551/0x770
[ 97.039407] __tcp_close+0x2c7/0x520
[ 97.039408] tcp_close+0x20/0x80
[ 97.039410] inet_release+0x34/0x60
[ 97.039412] __sock_release+0x3d/0xc0
[ 97.039413] sock_close+0x14/0x20
[ 97.039414] __fput+0xf1/0x2c0
[ 97.039416] task_work_run+0x58/0x90
[ 97.039418] exit_to_user_mode_loop+0x12c/0x150
[ 97.039420] do_syscall_64+0x2a0/0xa60
[ 97.039422] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 97.039423] RIP: 0033:0x7f869032e317
[ 97.039425] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 97.039430] RSP: 002b:00007fff7ceb31c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 97.039432] RAX: 0000000000000001 RBX: 00007fff7ceb44bd RCX: 00007f869032e317
[ 97.039433] RDX: 0000000000000001 RSI: 00007f869044c719 RDI: 0000000000000003
[ 97.039433] RBP: 0000000000000003 R08: 0000000017c8a850 R09: 00007f86903c44e0
[ 97.039434] R10: 00007f8690252130 R11: 0000000000000246 R12: 00007f869044c719
[ 97.039435] R13: 0000000017c8a4c0 R14: 0000000017c8a4c0 R15: 0000000017c8b680
[ 97.039438] </TASK>
[ 97.263257] brd: module unloaded
Fix this by using GFP_ATOMIC instead of gfp_any() in tcp_disconnect().
This matches the existing pattern in __tcp_close() which already uses
GFP_ATOMIC when calling tcp_send_active_reset() (tcp.c:3246).
gfp_any() only considers softirq vs process context, but doesn't
account for lock context where sleeping is unsafe.
The issue was discovered with blktests md/001, which creates an MD RAID1
array with internal bitmap over NVMe-TCP, then stops the array. This
triggers the block device removal -> elevator cleanup -> network teardown
path that exposes the circular dependency.
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
---
Hi,
Full disclosure: I'm not an expert in this area, if there is a better
solution, I'll be happy to try that.
-ck
---
net/ipv4/tcp.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8a18aeca7ab0..9fd01a8b90b5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3363,14 +3363,15 @@ int tcp_disconnect(struct sock *sk, int flags)
} else if (unlikely(tp->repair)) {
WRITE_ONCE(sk->sk_err, ECONNABORTED);
} else if (tcp_need_reset(old_state)) {
- tcp_send_active_reset(sk, gfp_any(), SK_RST_REASON_TCP_STATE);
+ /* Use GFP_ATOMIC since we're holding sk_lock */
+ tcp_send_active_reset(sk, GFP_ATOMIC, SK_RST_REASON_TCP_STATE);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (tp->snd_nxt != tp->write_seq &&
(1 << old_state) & (TCPF_CLOSING | TCPF_LAST_ACK)) {
/* The last check adjusts for discrepancy of Linux wrt. RFC
* states
*/
- tcp_send_active_reset(sk, gfp_any(),
+ tcp_send_active_reset(sk, GFP_ATOMIC,
SK_RST_REASON_TCP_DISCONNECT_WITH_DATA);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
--
2.40.0
next reply other threads:[~2025-11-25 6:11 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-25 6:11 Chaitanya Kulkarni [this message]
2025-11-25 6:27 ` [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect Christoph Hellwig
2025-11-25 7:28 ` Chaitanya Kulkarni
2025-11-25 11:08 ` hch
2025-11-25 11:13 ` Nilay Shroff
2025-11-25 11:21 ` hch
2025-11-25 11:28 ` Nilay Shroff
2025-11-25 11:30 ` Christoph Hellwig
2025-11-25 11:54 ` Nilay Shroff
2025-11-25 12:33 ` Hannes Reinecke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251125061142.18094-1-ckulkarnilinux@gmail.com \
--to=ckulkarnilinux@gmail.com \
--cc=axboe@kernel.dk \
--cc=bmarzins@redhat.com \
--cc=davem@davemloft.net \
--cc=dlemoal@kernel.org \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=horms@kernel.org \
--cc=john.g.garry@oracle.com \
--cc=kbusch@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linan122@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=mpatocka@redhat.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sagi@grimberg.me \
--cc=wagi@kernel.org \
--cc=xni@redhat.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).