[RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
To: kbusch@kernel.org, hch@lst.de, hare@suse.de, sagi@grimberg.me,
	axboe@kernel.dk, dlemoal@kernel.org, wagi@kernel.org,
	mpatocka@redhat.com, yukuai3@huawei.com, xni@redhat.com,
	linan122@huawei.com, bmarzins@redhat.com,
	john.g.garry@oracle.com, edumazet@google.com,
	ncardwell@google.com, kuniyu@google.com, davem@davemloft.net,
	dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org
Cc: netdev@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org,
	Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Subject: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
Date: Mon, 24 Nov 2025 22:11:42 -0800	[thread overview]
Message-ID: <20251125061142.18094-1-ckulkarnilinux@gmail.com> (raw)

tcp_disconnect() calls tcp_send_active_reset() with gfp_any(), which
returns GFP_KERNEL in process context. This can trigger a circular
locking dependency when called during block device teardown that
involves network-backed storage.

The deadlock scenario occurs with storage configurations like MD RAID
over NVMeOF TCP when tearing down the block device:

CPU0 (mdadm --stop /dev/mdX):          CPU1 (NVMe I/O submission):
================================       ===========================
del_gendisk()
  blk_unregister_queue()
    elevator_set_none()
      elevator_switch()
        __synchronize_srcu()
          [holds set->srcu]
          [waits for operations]
                                       nvme_tcp_queue_rq()
                                         nvme_tcp_try_send()
                                           tcp_sendmsg()
                                             lock_sock_nested()
                                               [holds sk_lock-AF_INET-NVME]
                                               [can wait for set->srcu]

    [cleanup triggers NVMe disconnect]
    nvme_tcp_teardown_io_queues()
      nvme_tcp_free_queue()
        sock_release()
          __sock_release()
            tcp_close()
              lock_sock_nested()
                [holds sk_lock-AF_INET-NVME]
                __tcp_close()
                  tcp_disconnect()
                    tcp_send_active_reset()
                      alloc_skb(gfp_any())
                        [GFP_KERNEL in process context]
                        kmem_cache_alloc_node()
                          fs_reclaim_acquire()
                            [can trigger writeback]
                            [needs block layer]
                            [waits for set->srcu]
                            *** DEADLOCK ***

blktests ./check md/001:

[   95.764798] run blktests md/001 at 2025-11-24 21:13:10
[   96.020965] brd: module loaded
[   96.098934] Key type psk registered
[   96.237974] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   96.244988] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[   96.286775] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[   96.290980] nvme nvme0: creating 48 I/O queues.
[   96.304554] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[   96.322530] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[   96.414331] md: async del_gendisk mode will be removed in future, please upgrade to mdadm-4.5+
[   96.414427] block device autoloading is deprecated and will be removed.
[   96.473347] md/raid1:md127: active with 1 out of 2 mirrors
[   96.474602] md127: detected capacity change from 0 to 2093056
[   96.665424] md127: detected capacity change from 2093056 to 0
[   96.665433] md: md127 stopped.
[   96.694365] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[   96.708310] block nvme0n1: no available path - failing I/O
[   96.708379] block nvme0n1: no available path - failing I/O
[   96.708414] block nvme0n1: no available path - failing I/O
[   96.708734] block nvme0n1: no available path - failing I/O
[   96.708745] block nvme0n1: no available path - failing I/O
[   96.708761] block nvme0n1: no available path - failing I/O

[   96.812432] ======================================================
[   96.816828] WARNING: possible circular locking dependency detected
[   96.821054] 6.18.0-rc6lblk-fnext+ #7 Tainted: G                 N
[   96.825312] ------------------------------------------------------
[   96.830181] nvme/2595 is trying to acquire lock:
[   96.833374] ffffffff82e487e0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x5a/0x770
[   96.839640]
               but task is already holding lock:
[   96.843657] ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[   96.849247]
               which lock already depends on the new lock.

[   96.854869]
               the existing dependency chain (in reverse order) is:
[   96.860473]
               -> #4 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[   96.865028]        lock_sock_nested+0x2e/0x70
[   96.868084]        tcp_sendmsg+0x1a/0x40
[   96.870833]        sock_sendmsg+0xed/0x110
[   96.873677]        nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
[   96.878007]        nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
[   96.881344]        nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
[   96.884399]        blk_mq_dispatch_rq_list+0x29a/0x800
[   96.887237]        __blk_mq_sched_dispatch_requests+0x3de/0x5f0
[   96.891116]        blk_mq_sched_dispatch_requests+0x29/0x70
[   96.894166]        blk_mq_run_work_fn+0x76/0x1b0
[   96.896710]        process_one_work+0x211/0x630
[   96.899162]        worker_thread+0x184/0x330
[   96.901503]        kthread+0x10d/0x250
[   96.903570]        ret_from_fork+0x29a/0x300
[   96.905888]        ret_from_fork_asm+0x1a/0x30
[   96.908186]
               -> #3 (set->srcu){.+.+}-{0:0}:
[   96.910188]        __synchronize_srcu+0x49/0x170
[   96.911882]        elevator_switch+0xc9/0x330
[   96.913459]        elevator_change+0x133/0x1b0
[   96.915079]        elevator_set_none+0x3b/0x80
[   96.916714]        blk_unregister_queue+0xb0/0x120
[   96.918450]        __del_gendisk+0x14e/0x3c0
[   96.920700]        del_gendisk+0x75/0xa0
[   96.922098]        nvme_ns_remove+0xf2/0x230 [nvme_core]
[   96.924044]        nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[   96.926220]        nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[   96.928310]        nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[   96.930429]        nvme_sysfs_delete+0x34/0x40 [nvme_core]
[   96.932450]        kernfs_fop_write_iter+0x16d/0x220
[   96.934271]        vfs_write+0x37b/0x520
[   96.935746]        ksys_write+0x67/0xe0
[   96.937141]        do_syscall_64+0x76/0xa60
[   96.938645]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   96.940628]
               -> #2 (&q->elevator_lock){+.+.}-{4:4}:
[   96.942903]        __mutex_lock+0xa2/0x1150
[   96.944434]        elevator_change+0x9b/0x1b0
[   96.946046]        elv_iosched_store+0x116/0x190
[   96.947746]        kernfs_fop_write_iter+0x16d/0x220
[   96.949524]        vfs_write+0x37b/0x520
[   96.951506]        ksys_write+0x67/0xe0
[   96.952934]        do_syscall_64+0x76/0xa60
[   96.954457]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   96.956489]
               -> #1 (&q->q_usage_counter(io)){++++}-{0:0}:
[   96.959011]        blk_alloc_queue+0x30e/0x350
[   96.960664]        blk_mq_alloc_queue+0x61/0xd0
[   96.962293]        scsi_alloc_sdev+0x2a0/0x3e0
[   96.963954]        scsi_probe_and_add_lun+0x1bd/0x430
[   96.965782]        __scsi_add_device+0x109/0x120
[   96.967461]        ata_scsi_scan_host+0x97/0x1c0
[   96.969198]        async_run_entry_fn+0x30/0x130
[   96.970903]        process_one_work+0x211/0x630
[   96.972577]        worker_thread+0x184/0x330
[   96.974097]        kthread+0x10d/0x250
[   96.975448]        ret_from_fork+0x29a/0x300
[   96.977050]        ret_from_fork_asm+0x1a/0x30
[   96.978705]
               -> #0 (fs_reclaim){+.+.}-{0:0}:
[   96.981265]        __lock_acquire+0x1468/0x2210
[   96.982950]        lock_acquire+0xd3/0x2f0
[   96.984445]        fs_reclaim_acquire+0x99/0xd0
[   96.986141]        kmem_cache_alloc_node_noprof+0x5a/0x770
[   96.988171]        __alloc_skb+0x15f/0x190
[   96.989681]        tcp_send_active_reset+0x3f/0x1e0
[   96.991248]        tcp_disconnect+0x551/0x770
[   96.992851]        __tcp_close+0x2c7/0x520
[   96.994327]        tcp_close+0x20/0x80
[   96.995727]        inet_release+0x34/0x60
[   96.997168]        __sock_release+0x3d/0xc0
[   96.998688]        sock_close+0x14/0x20
[   97.000058]        __fput+0xf1/0x2c0
[   97.001388]        task_work_run+0x58/0x90
[   97.002922]        exit_to_user_mode_loop+0x12c/0x150
[   97.004720]        do_syscall_64+0x2a0/0xa60
[   97.006256]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   97.008279]
               other info that might help us debug this:

[   97.011827] Chain exists of:
                 fs_reclaim --> set->srcu --> sk_lock-AF_INET-NVME

[   97.015506]  Possible unsafe locking scenario:

[   97.017718]        CPU0                    CPU1
[   97.019363]        ----                    ----
[   97.020984]   lock(sk_lock-AF_INET-NVME);
[   97.022399]                                lock(set->srcu);
[   97.024415]                                lock(sk_lock-AF_INET-NVME);
[   97.026798]   lock(fs_reclaim);
[   97.027927]
                *** DEADLOCK ***

[   97.030010] 2 locks held by nvme/2595:
[   97.031353]  #0: ffff88810047b388 (&sb->s_type->i_mutex_key#10){+.+.}-{4:4}, at: __sock_release+0x30/0xc0
[   97.034820]  #1: ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[   97.037806]
               stack backtrace:
[   97.039367] CPU: 2 UID: 0 PID: 2595 Comm: nvme Tainted: G                 N  6.18.0-rc6lblk-fnext+ #7 PREEMPT(voluntary)
[   97.039370] Tainted: [N]=TEST
[   97.039371] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   97.039372] Call Trace:
[   97.039374]  <TASK>
[   97.039375]  dump_stack_lvl+0x75/0xb0
[   97.039379]  print_circular_bug+0x26a/0x330
[   97.039381]  check_noncircular+0x12f/0x150
[   97.039385]  __lock_acquire+0x1468/0x2210
[   97.039388]  lock_acquire+0xd3/0x2f0
[   97.039390]  ? kmem_cache_alloc_node_noprof+0x5a/0x770
[   97.039393]  fs_reclaim_acquire+0x99/0xd0
[   97.039395]  ? kmem_cache_alloc_node_noprof+0x5a/0x770
[   97.039396]  kmem_cache_alloc_node_noprof+0x5a/0x770
[   97.039397]  ? __alloc_skb+0x15f/0x190
[   97.039400]  ? __alloc_skb+0x15f/0x190
[   97.039401]  __alloc_skb+0x15f/0x190
[   97.039403]  tcp_send_active_reset+0x3f/0x1e0
[   97.039405]  tcp_disconnect+0x551/0x770
[   97.039407]  __tcp_close+0x2c7/0x520
[   97.039408]  tcp_close+0x20/0x80
[   97.039410]  inet_release+0x34/0x60
[   97.039412]  __sock_release+0x3d/0xc0
[   97.039413]  sock_close+0x14/0x20
[   97.039414]  __fput+0xf1/0x2c0
[   97.039416]  task_work_run+0x58/0x90
[   97.039418]  exit_to_user_mode_loop+0x12c/0x150
[   97.039420]  do_syscall_64+0x2a0/0xa60
[   97.039422]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   97.039423] RIP: 0033:0x7f869032e317
[   97.039425] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   97.039430] RSP: 002b:00007fff7ceb31c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   97.039432] RAX: 0000000000000001 RBX: 00007fff7ceb44bd RCX: 00007f869032e317
[   97.039433] RDX: 0000000000000001 RSI: 00007f869044c719 RDI: 0000000000000003
[   97.039433] RBP: 0000000000000003 R08: 0000000017c8a850 R09: 00007f86903c44e0
[   97.039434] R10: 00007f8690252130 R11: 0000000000000246 R12: 00007f869044c719
[   97.039435] R13: 0000000017c8a4c0 R14: 0000000017c8a4c0 R15: 0000000017c8b680
[   97.039438]  </TASK>
[   97.263257] brd: module unloaded

Fix this by using GFP_ATOMIC instead of gfp_any() in tcp_disconnect().
This matches the existing pattern in __tcp_close() which already uses
GFP_ATOMIC when calling tcp_send_active_reset() (tcp.c:3246).
gfp_any() only considers softirq vs process context, but doesn't
account for lock context where sleeping is unsafe.

The issue was discovered with blktests md/001, which creates an MD RAID1
array with internal bitmap over NVMe-TCP, then stops the array. This
triggers the block device removal -> elevator cleanup -> network teardown
path that exposes the circular dependency.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
---

Hi,

Full disclosure: I'm not an expert in this area, if there is a better
solution, I'll be happy to try that.

-ck

---
 net/ipv4/tcp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8a18aeca7ab0..9fd01a8b90b5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3363,14 +3363,15 @@ int tcp_disconnect(struct sock *sk, int flags)
 	} else if (unlikely(tp->repair)) {
 		WRITE_ONCE(sk->sk_err, ECONNABORTED);
 	} else if (tcp_need_reset(old_state)) {
-		tcp_send_active_reset(sk, gfp_any(), SK_RST_REASON_TCP_STATE);
+		/* Use GFP_ATOMIC since we're holding sk_lock */
+		tcp_send_active_reset(sk, GFP_ATOMIC, SK_RST_REASON_TCP_STATE);
 		WRITE_ONCE(sk->sk_err, ECONNRESET);
 	} else if (tp->snd_nxt != tp->write_seq &&
 		   (1 << old_state) & (TCPF_CLOSING | TCPF_LAST_ACK)) {
 		/* The last check adjusts for discrepancy of Linux wrt. RFC
 		 * states
 		 */
-		tcp_send_active_reset(sk, gfp_any(),
+		tcp_send_active_reset(sk, GFP_ATOMIC,
 				      SK_RST_REASON_TCP_DISCONNECT_WITH_DATA);
 		WRITE_ONCE(sk->sk_err, ECONNRESET);
 	} else if (old_state == TCP_SYN_SENT)
-- 
2.40.0

next             reply	other threads:[~2025-11-25  6:11 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-25  6:11 Chaitanya Kulkarni [this message]
2025-11-25  6:27 ` [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect Christoph Hellwig
2025-11-25  7:28   ` Chaitanya Kulkarni
2025-11-25 11:08     ` hch
2025-11-25 11:13     ` Nilay Shroff
2025-11-25 11:21       ` hch
2025-11-25 11:28         ` Nilay Shroff
2025-11-25 11:30           ` Christoph Hellwig
2025-11-25 11:54             ` Nilay Shroff
2025-11-25 12:33               ` Hannes Reinecke

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:8a18aeca7ab dfblob:9fd01a8b90b )
 OR (
bs:"[RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251125061142.18094-1-ckulkarnilinux@gmail.com \
    --to=ckulkarnilinux@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bmarzins@redhat.com \
    --cc=davem@davemloft.net \
    --cc=dlemoal@kernel.org \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=horms@kernel.org \
    --cc=john.g.garry@oracle.com \
    --cc=kbusch@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linan122@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mpatocka@redhat.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=wagi@kernel.org \
    --cc=xni@redhat.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).