[RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
@ 2025-11-25  6:11 Chaitanya Kulkarni
  2025-11-25  6:27 ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Chaitanya Kulkarni @ 2025-11-25  6:11 UTC (permalink / raw)
  To: kbusch, hch, hare, sagi, axboe, dlemoal, wagi, mpatocka, yukuai3,
	xni, linan122, bmarzins, john.g.garry, edumazet, ncardwell,
	kuniyu, davem, dsahern, kuba, pabeni, horms
  Cc: netdev, linux-nvme, linux-block, Chaitanya Kulkarni

tcp_disconnect() calls tcp_send_active_reset() with gfp_any(), which
returns GFP_KERNEL in process context. This can trigger a circular
locking dependency when called during block device teardown that
involves network-backed storage.

The deadlock scenario occurs with storage configurations like MD RAID
over NVMeOF TCP when tearing down the block device:

CPU0 (mdadm --stop /dev/mdX):          CPU1 (NVMe I/O submission):
================================       ===========================
del_gendisk()
  blk_unregister_queue()
    elevator_set_none()
      elevator_switch()
        __synchronize_srcu()
          [holds set->srcu]
          [waits for operations]
                                       nvme_tcp_queue_rq()
                                         nvme_tcp_try_send()
                                           tcp_sendmsg()
                                             lock_sock_nested()
                                               [holds sk_lock-AF_INET-NVME]
                                               [can wait for set->srcu]

    [cleanup triggers NVMe disconnect]
    nvme_tcp_teardown_io_queues()
      nvme_tcp_free_queue()
        sock_release()
          __sock_release()
            tcp_close()
              lock_sock_nested()
                [holds sk_lock-AF_INET-NVME]
                __tcp_close()
                  tcp_disconnect()
                    tcp_send_active_reset()
                      alloc_skb(gfp_any())
                        [GFP_KERNEL in process context]
                        kmem_cache_alloc_node()
                          fs_reclaim_acquire()
                            [can trigger writeback]
                            [needs block layer]
                            [waits for set->srcu]
                            *** DEADLOCK ***

blktests ./check md/001:

[   95.764798] run blktests md/001 at 2025-11-24 21:13:10
[   96.020965] brd: module loaded
[   96.098934] Key type psk registered
[   96.237974] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   96.244988] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[   96.286775] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[   96.290980] nvme nvme0: creating 48 I/O queues.
[   96.304554] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[   96.322530] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[   96.414331] md: async del_gendisk mode will be removed in future, please upgrade to mdadm-4.5+
[   96.414427] block device autoloading is deprecated and will be removed.
[   96.473347] md/raid1:md127: active with 1 out of 2 mirrors
[   96.474602] md127: detected capacity change from 0 to 2093056
[   96.665424] md127: detected capacity change from 2093056 to 0
[   96.665433] md: md127 stopped.
[   96.694365] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[   96.708310] block nvme0n1: no available path - failing I/O
[   96.708379] block nvme0n1: no available path - failing I/O
[   96.708414] block nvme0n1: no available path - failing I/O
[   96.708734] block nvme0n1: no available path - failing I/O
[   96.708745] block nvme0n1: no available path - failing I/O
[   96.708761] block nvme0n1: no available path - failing I/O

[   96.812432] ======================================================
[   96.816828] WARNING: possible circular locking dependency detected
[   96.821054] 6.18.0-rc6lblk-fnext+ #7 Tainted: G                 N
[   96.825312] ------------------------------------------------------
[   96.830181] nvme/2595 is trying to acquire lock:
[   96.833374] ffffffff82e487e0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x5a/0x770
[   96.839640]
               but task is already holding lock:
[   96.843657] ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[   96.849247]
               which lock already depends on the new lock.

[   96.854869]
               the existing dependency chain (in reverse order) is:
[   96.860473]
               -> #4 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[   96.865028]        lock_sock_nested+0x2e/0x70
[   96.868084]        tcp_sendmsg+0x1a/0x40
[   96.870833]        sock_sendmsg+0xed/0x110
[   96.873677]        nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
[   96.878007]        nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
[   96.881344]        nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
[   96.884399]        blk_mq_dispatch_rq_list+0x29a/0x800
[   96.887237]        __blk_mq_sched_dispatch_requests+0x3de/0x5f0
[   96.891116]        blk_mq_sched_dispatch_requests+0x29/0x70
[   96.894166]        blk_mq_run_work_fn+0x76/0x1b0
[   96.896710]        process_one_work+0x211/0x630
[   96.899162]        worker_thread+0x184/0x330
[   96.901503]        kthread+0x10d/0x250
[   96.903570]        ret_from_fork+0x29a/0x300
[   96.905888]        ret_from_fork_asm+0x1a/0x30
[   96.908186]
               -> #3 (set->srcu){.+.+}-{0:0}:
[   96.910188]        __synchronize_srcu+0x49/0x170
[   96.911882]        elevator_switch+0xc9/0x330
[   96.913459]        elevator_change+0x133/0x1b0
[   96.915079]        elevator_set_none+0x3b/0x80
[   96.916714]        blk_unregister_queue+0xb0/0x120
[   96.918450]        __del_gendisk+0x14e/0x3c0
[   96.920700]        del_gendisk+0x75/0xa0
[   96.922098]        nvme_ns_remove+0xf2/0x230 [nvme_core]
[   96.924044]        nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[   96.926220]        nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[   96.928310]        nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[   96.930429]        nvme_sysfs_delete+0x34/0x40 [nvme_core]
[   96.932450]        kernfs_fop_write_iter+0x16d/0x220
[   96.934271]        vfs_write+0x37b/0x520
[   96.935746]        ksys_write+0x67/0xe0
[   96.937141]        do_syscall_64+0x76/0xa60
[   96.938645]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   96.940628]
               -> #2 (&q->elevator_lock){+.+.}-{4:4}:
[   96.942903]        __mutex_lock+0xa2/0x1150
[   96.944434]        elevator_change+0x9b/0x1b0
[   96.946046]        elv_iosched_store+0x116/0x190
[   96.947746]        kernfs_fop_write_iter+0x16d/0x220
[   96.949524]        vfs_write+0x37b/0x520
[   96.951506]        ksys_write+0x67/0xe0
[   96.952934]        do_syscall_64+0x76/0xa60
[   96.954457]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   96.956489]
               -> #1 (&q->q_usage_counter(io)){++++}-{0:0}:
[   96.959011]        blk_alloc_queue+0x30e/0x350
[   96.960664]        blk_mq_alloc_queue+0x61/0xd0
[   96.962293]        scsi_alloc_sdev+0x2a0/0x3e0
[   96.963954]        scsi_probe_and_add_lun+0x1bd/0x430
[   96.965782]        __scsi_add_device+0x109/0x120
[   96.967461]        ata_scsi_scan_host+0x97/0x1c0
[   96.969198]        async_run_entry_fn+0x30/0x130
[   96.970903]        process_one_work+0x211/0x630
[   96.972577]        worker_thread+0x184/0x330
[   96.974097]        kthread+0x10d/0x250
[   96.975448]        ret_from_fork+0x29a/0x300
[   96.977050]        ret_from_fork_asm+0x1a/0x30
[   96.978705]
               -> #0 (fs_reclaim){+.+.}-{0:0}:
[   96.981265]        __lock_acquire+0x1468/0x2210
[   96.982950]        lock_acquire+0xd3/0x2f0
[   96.984445]        fs_reclaim_acquire+0x99/0xd0
[   96.986141]        kmem_cache_alloc_node_noprof+0x5a/0x770
[   96.988171]        __alloc_skb+0x15f/0x190
[   96.989681]        tcp_send_active_reset+0x3f/0x1e0
[   96.991248]        tcp_disconnect+0x551/0x770
[   96.992851]        __tcp_close+0x2c7/0x520
[   96.994327]        tcp_close+0x20/0x80
[   96.995727]        inet_release+0x34/0x60
[   96.997168]        __sock_release+0x3d/0xc0
[   96.998688]        sock_close+0x14/0x20
[   97.000058]        __fput+0xf1/0x2c0
[   97.001388]        task_work_run+0x58/0x90
[   97.002922]        exit_to_user_mode_loop+0x12c/0x150
[   97.004720]        do_syscall_64+0x2a0/0xa60
[   97.006256]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   97.008279]
               other info that might help us debug this:

[   97.011827] Chain exists of:
                 fs_reclaim --> set->srcu --> sk_lock-AF_INET-NVME

[   97.015506]  Possible unsafe locking scenario:

[   97.017718]        CPU0                    CPU1
[   97.019363]        ----                    ----
[   97.020984]   lock(sk_lock-AF_INET-NVME);
[   97.022399]                                lock(set->srcu);
[   97.024415]                                lock(sk_lock-AF_INET-NVME);
[   97.026798]   lock(fs_reclaim);
[   97.027927]
                *** DEADLOCK ***

[   97.030010] 2 locks held by nvme/2595:
[   97.031353]  #0: ffff88810047b388 (&sb->s_type->i_mutex_key#10){+.+.}-{4:4}, at: __sock_release+0x30/0xc0
[   97.034820]  #1: ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[   97.037806]
               stack backtrace:
[   97.039367] CPU: 2 UID: 0 PID: 2595 Comm: nvme Tainted: G                 N  6.18.0-rc6lblk-fnext+ #7 PREEMPT(voluntary)
[   97.039370] Tainted: [N]=TEST
[   97.039371] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   97.039372] Call Trace:
[   97.039374]  <TASK>
[   97.039375]  dump_stack_lvl+0x75/0xb0
[   97.039379]  print_circular_bug+0x26a/0x330
[   97.039381]  check_noncircular+0x12f/0x150
[   97.039385]  __lock_acquire+0x1468/0x2210
[   97.039388]  lock_acquire+0xd3/0x2f0
[   97.039390]  ? kmem_cache_alloc_node_noprof+0x5a/0x770
[   97.039393]  fs_reclaim_acquire+0x99/0xd0
[   97.039395]  ? kmem_cache_alloc_node_noprof+0x5a/0x770
[   97.039396]  kmem_cache_alloc_node_noprof+0x5a/0x770
[   97.039397]  ? __alloc_skb+0x15f/0x190
[   97.039400]  ? __alloc_skb+0x15f/0x190
[   97.039401]  __alloc_skb+0x15f/0x190
[   97.039403]  tcp_send_active_reset+0x3f/0x1e0
[   97.039405]  tcp_disconnect+0x551/0x770
[   97.039407]  __tcp_close+0x2c7/0x520
[   97.039408]  tcp_close+0x20/0x80
[   97.039410]  inet_release+0x34/0x60
[   97.039412]  __sock_release+0x3d/0xc0
[   97.039413]  sock_close+0x14/0x20
[   97.039414]  __fput+0xf1/0x2c0
[   97.039416]  task_work_run+0x58/0x90
[   97.039418]  exit_to_user_mode_loop+0x12c/0x150
[   97.039420]  do_syscall_64+0x2a0/0xa60
[   97.039422]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   97.039423] RIP: 0033:0x7f869032e317
[   97.039425] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   97.039430] RSP: 002b:00007fff7ceb31c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   97.039432] RAX: 0000000000000001 RBX: 00007fff7ceb44bd RCX: 00007f869032e317
[   97.039433] RDX: 0000000000000001 RSI: 00007f869044c719 RDI: 0000000000000003
[   97.039433] RBP: 0000000000000003 R08: 0000000017c8a850 R09: 00007f86903c44e0
[   97.039434] R10: 00007f8690252130 R11: 0000000000000246 R12: 00007f869044c719
[   97.039435] R13: 0000000017c8a4c0 R14: 0000000017c8a4c0 R15: 0000000017c8b680
[   97.039438]  </TASK>
[   97.263257] brd: module unloaded

Fix this by using GFP_ATOMIC instead of gfp_any() in tcp_disconnect().
This matches the existing pattern in __tcp_close() which already uses
GFP_ATOMIC when calling tcp_send_active_reset() (tcp.c:3246).
gfp_any() only considers softirq vs process context, but doesn't
account for lock context where sleeping is unsafe.

The issue was discovered with blktests md/001, which creates an MD RAID1
array with internal bitmap over NVMe-TCP, then stops the array. This
triggers the block device removal -> elevator cleanup -> network teardown
path that exposes the circular dependency.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
---

Hi,

Full disclosure: I'm not an expert in this area, if there is a better
solution, I'll be happy to try that.

-ck

---
 net/ipv4/tcp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8a18aeca7ab0..9fd01a8b90b5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3363,14 +3363,15 @@ int tcp_disconnect(struct sock *sk, int flags)
 	} else if (unlikely(tp->repair)) {
 		WRITE_ONCE(sk->sk_err, ECONNABORTED);
 	} else if (tcp_need_reset(old_state)) {
-		tcp_send_active_reset(sk, gfp_any(), SK_RST_REASON_TCP_STATE);
+		/* Use GFP_ATOMIC since we're holding sk_lock */
+		tcp_send_active_reset(sk, GFP_ATOMIC, SK_RST_REASON_TCP_STATE);
 		WRITE_ONCE(sk->sk_err, ECONNRESET);
 	} else if (tp->snd_nxt != tp->write_seq &&
 		   (1 << old_state) & (TCPF_CLOSING | TCPF_LAST_ACK)) {
 		/* The last check adjusts for discrepancy of Linux wrt. RFC
 		 * states
 		 */
-		tcp_send_active_reset(sk, gfp_any(),
+		tcp_send_active_reset(sk, GFP_ATOMIC,
 				      SK_RST_REASON_TCP_DISCONNECT_WITH_DATA);
 		WRITE_ONCE(sk->sk_err, ECONNRESET);
 	} else if (old_state == TCP_SYN_SENT)
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25  6:11 [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect Chaitanya Kulkarni
@ 2025-11-25  6:27 ` Christoph Hellwig
  2025-11-25  7:28   ` Chaitanya Kulkarni
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2025-11-25  6:27 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: kbusch, hch, hare, sagi, axboe, dlemoal, wagi, mpatocka, yukuai3,
	xni, linan122, bmarzins, john.g.garry, edumazet, ncardwell,
	kuniyu, davem, dsahern, kuba, pabeni, horms, netdev, linux-nvme,
	linux-block

I don't think GFP_ATOMIC is right here, you want GFP_NOIO.

And just use the scope API so that you don't have to pass a gfp_t
several layers down.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25  6:27 ` Christoph Hellwig
@ 2025-11-25  7:28   ` Chaitanya Kulkarni
  2025-11-25 11:08     ` hch
  2025-11-25 11:13     ` Nilay Shroff
  0 siblings, 2 replies; 10+ messages in thread
From: Chaitanya Kulkarni @ 2025-11-25  7:28 UTC (permalink / raw)
  To: Christoph Hellwig, Chaitanya Kulkarni
  Cc: kbusch@kernel.org, hch@lst.de, hare@suse.de, sagi@grimberg.me,
	axboe@kernel.dk, dlemoal@kernel.org, wagi@kernel.org,
	mpatocka@redhat.com, yukuai3@huawei.com, xni@redhat.com,
	linan122@huawei.com, bmarzins@redhat.com, john.g.garry@oracle.com,
	edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
	davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org

On 11/24/25 22:27, Christoph Hellwig wrote:
> I don't think GFP_ATOMIC is right here, you want GFP_NOIO.
>
> And just use the scope API so that you don't have to pass a gfp_t
> several layers down.
>
>
are you saying something like this ?

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 29ad4735fac6..56d0a3777a4d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1438,17 +1438,28 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
  	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
  	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
  	unsigned int noreclaim_flag;
+	unsigned int noio_flag;
  
  	if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
  		return;
  
  	page_frag_cache_drain(&queue->pf_cache);
  
+	/**
+	 * Prevent memory reclaim from triggering block I/O during socket
+	 * teardown. The socket release path fput -> tcp_close ->
+	 * tcp_disconnect -> tcp_send_active_reset may allocate memory, and
+	 * allowing reclaim to issue I/O could deadlock if we're being called
+	 * from block device teardown (e.g., del_gendisk -> elevator cleanup)
+	 * which holds locks that the I/O completion path needs.
+	 */
+	noio_flag = memalloc_noio_save();
  	noreclaim_flag = memalloc_noreclaim_save();
  	/* ->sock will be released by fput() */
  	fput(queue->sock->file);
  	queue->sock = NULL;
  	memalloc_noreclaim_restore(noreclaim_flag);
+	memalloc_noio_restore(noio_flag);
  
  	kfree(queue->pdu);
  	mutex_destroy(&queue->send_mutex);
-- 
2.40.0

-ck



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25  7:28   ` Chaitanya Kulkarni
@ 2025-11-25 11:08     ` hch
  2025-11-25 11:13     ` Nilay Shroff
  1 sibling, 0 replies; 10+ messages in thread
From: hch @ 2025-11-25 11:08 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Chaitanya Kulkarni, kbusch@kernel.org, hch@lst.de, hare@suse.de,
	sagi@grimberg.me, axboe@kernel.dk, dlemoal@kernel.org,
	wagi@kernel.org, mpatocka@redhat.com, yukuai3@huawei.com,
	xni@redhat.com, linan122@huawei.com, bmarzins@redhat.com,
	john.g.garry@oracle.com, edumazet@google.com,
	ncardwell@google.com, kuniyu@google.com, davem@davemloft.net,
	dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org

On Tue, Nov 25, 2025 at 07:28:45AM +0000, Chaitanya Kulkarni wrote:
> are you saying something like this ?

Yes, something like this.  I wonder if there's a way to avoid the nested
flag saving, though.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25  7:28   ` Chaitanya Kulkarni
  2025-11-25 11:08     ` hch
@ 2025-11-25 11:13     ` Nilay Shroff
  2025-11-25 11:21       ` hch
  1 sibling, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2025-11-25 11:13 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Christoph Hellwig, Chaitanya Kulkarni
  Cc: kbusch@kernel.org, hch@lst.de, hare@suse.de, sagi@grimberg.me,
	axboe@kernel.dk, dlemoal@kernel.org, wagi@kernel.org,
	mpatocka@redhat.com, yukuai3@huawei.com, xni@redhat.com,
	linan122@huawei.com, bmarzins@redhat.com, john.g.garry@oracle.com,
	edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
	davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org



On 11/25/25 12:58 PM, Chaitanya Kulkarni wrote:
> On 11/24/25 22:27, Christoph Hellwig wrote:
>> I don't think GFP_ATOMIC is right here, you want GFP_NOIO.
>>
>> And just use the scope API so that you don't have to pass a gfp_t
>> several layers down.
>>
>>
> are you saying something like this ?
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 29ad4735fac6..56d0a3777a4d 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1438,17 +1438,28 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>   	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
>   	unsigned int noreclaim_flag;
> +	unsigned int noio_flag;
>   
>   	if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>   		return;
>   
>   	page_frag_cache_drain(&queue->pf_cache);
>   
> +	/**
> +	 * Prevent memory reclaim from triggering block I/O during socket
> +	 * teardown. The socket release path fput -> tcp_close ->
> +	 * tcp_disconnect -> tcp_send_active_reset may allocate memory, and
> +	 * allowing reclaim to issue I/O could deadlock if we're being called
> +	 * from block device teardown (e.g., del_gendisk -> elevator cleanup)
> +	 * which holds locks that the I/O completion path needs.
> +	 */
> +	noio_flag = memalloc_noio_save();
>   	noreclaim_flag = memalloc_noreclaim_save();
>   	/* ->sock will be released by fput() */
>   	fput(queue->sock->file);
>   	queue->sock = NULL;
>   	memalloc_noreclaim_restore(noreclaim_flag);
> +	memalloc_noio_restore(noio_flag);
>   
>   	kfree(queue->pdu);
>   	mutex_destroy(&queue->send_mutex);

The memalloc_noreclaim_save() above shall already prevent filesystem reclaim,
so if the goal is to avoid fs_reclaim, we should not need an additional
memalloc_noio_save() here. That makes me wonder whether we are looking at the
correct code path. If this teardown path (nvme_tcp_free_queue()) is indeed executed,
it should already be avoiding filesystem reclaim in the first place.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25 11:13     ` Nilay Shroff
@ 2025-11-25 11:21       ` hch
  2025-11-25 11:28         ` Nilay Shroff
  0 siblings, 1 reply; 10+ messages in thread
From: hch @ 2025-11-25 11:21 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Chaitanya Kulkarni,
	kbusch@kernel.org, hch@lst.de, hare@suse.de, sagi@grimberg.me,
	axboe@kernel.dk, dlemoal@kernel.org, wagi@kernel.org,
	mpatocka@redhat.com, yukuai3@huawei.com, xni@redhat.com,
	linan122@huawei.com, bmarzins@redhat.com, john.g.garry@oracle.com,
	edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
	davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org

On Tue, Nov 25, 2025 at 04:43:25PM +0530, Nilay Shroff wrote:
> The memalloc_noreclaim_save() above shall already prevent filesystem
> reclaim,

memalloc_noreclaim_save is oddly misnamed, as it sets the
PF_MEMALLOC, which does not cause any gfp_t flag adjustments, but
instead avoid direct reclaim.  Thinking of it I have no idea why
it is even used here.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25 11:21       ` hch
@ 2025-11-25 11:28         ` Nilay Shroff
  2025-11-25 11:30           ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2025-11-25 11:28 UTC (permalink / raw)
  To: hch@lst.de
  Cc: Chaitanya Kulkarni, Christoph Hellwig, Chaitanya Kulkarni,
	kbusch@kernel.org, hare@suse.de, sagi@grimberg.me,
	axboe@kernel.dk, dlemoal@kernel.org, wagi@kernel.org,
	mpatocka@redhat.com, yukuai3@huawei.com, xni@redhat.com,
	linan122@huawei.com, bmarzins@redhat.com, john.g.garry@oracle.com,
	edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
	davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org



On 11/25/25 4:51 PM, hch@lst.de wrote:
> On Tue, Nov 25, 2025 at 04:43:25PM +0530, Nilay Shroff wrote:
>> The memalloc_noreclaim_save() above shall already prevent filesystem
>> reclaim,
> 
> memalloc_noreclaim_save is oddly misnamed, as it sets the
> PF_MEMALLOC, which does not cause any gfp_t flag adjustments, but
> instead avoid direct reclaim.  Thinking of it I have no idea why
> it is even used here.
> 
From git history, I see that was added to avoid memory reclaim  to avoid
possible circular locking dependency. This commit 83e1226b0ee2 ("nvme-tcp:
fix possible circular locking when deleting a controller under memory
pressure") adds it.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25 11:28         ` Nilay Shroff
@ 2025-11-25 11:30           ` Christoph Hellwig
  2025-11-25 11:54             ` Nilay Shroff
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2025-11-25 11:30 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Chaitanya Kulkarni, Chaitanya Kulkarni, kbusch@kernel.org,
	hare@suse.de, sagi@grimberg.me, axboe@kernel.dk,
	dlemoal@kernel.org, wagi@kernel.org, mpatocka@redhat.com,
	yukuai3@huawei.com, xni@redhat.com, linan122@huawei.com,
	bmarzins@redhat.com, john.g.garry@oracle.com, edumazet@google.com,
	ncardwell@google.com, kuniyu@google.com, davem@davemloft.net,
	dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org

On Tue, Nov 25, 2025 at 04:58:32PM +0530, Nilay Shroff wrote:
> >From git history, I see that was added to avoid memory reclaim  to avoid
> possible circular locking dependency. This commit 83e1226b0ee2 ("nvme-tcp:
> fix possible circular locking when deleting a controller under memory
> pressure") adds it.

I suspect this was intended to be noio, and we should just switch to
that.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25 11:30           ` Christoph Hellwig
@ 2025-11-25 11:54             ` Nilay Shroff
  2025-11-25 12:33               ` Hannes Reinecke
  0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2025-11-25 11:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chaitanya Kulkarni, Chaitanya Kulkarni, kbusch@kernel.org,
	hare@suse.de, sagi@grimberg.me, axboe@kernel.dk,
	dlemoal@kernel.org, wagi@kernel.org, mpatocka@redhat.com,
	yukuai3@huawei.com, xni@redhat.com, linan122@huawei.com,
	bmarzins@redhat.com, john.g.garry@oracle.com, edumazet@google.com,
	ncardwell@google.com, kuniyu@google.com, davem@davemloft.net,
	dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org



On 11/25/25 5:00 PM, Christoph Hellwig wrote:
> On Tue, Nov 25, 2025 at 04:58:32PM +0530, Nilay Shroff wrote:
>> >From git history, I see that was added to avoid memory reclaim  to avoid
>> possible circular locking dependency. This commit 83e1226b0ee2 ("nvme-tcp:
>> fix possible circular locking when deleting a controller under memory
>> pressure") adds it.
> 
> I suspect this was intended to be noio, and we should just switch to
> that.
> 
Yeah, I agree that this should be changed to noio. However, it seems that
this alone may not be sufficient to fix the lockdep splat reported here.
Since the real work of fput() might be deferred to another thread, using a noio
scope in this path may not have the intended effect and could end up being
redundant. 

That said, I noticed another fix from Chaitanya [1], where fput() is replaced
with __fput_sync(). With that change in place, combining the noio scope adjustment
with the switch to __fput_sync() should indeed address the lockdep issue. Given
that both problems have very similar lockdep signatures, it might make sense to
merge these two changes into a single fix.

[1] https://lore.kernel.org/all/20251125005950.41046-1-ckulkarnilinux@gmail.com/

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect
  2025-11-25 11:54             ` Nilay Shroff
@ 2025-11-25 12:33               ` Hannes Reinecke
  0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2025-11-25 12:33 UTC (permalink / raw)
  To: Nilay Shroff, Christoph Hellwig
  Cc: Chaitanya Kulkarni, Chaitanya Kulkarni, kbusch@kernel.org,
	sagi@grimberg.me, axboe@kernel.dk, dlemoal@kernel.org,
	wagi@kernel.org, mpatocka@redhat.com, yukuai3@huawei.com,
	xni@redhat.com, linan122@huawei.com, bmarzins@redhat.com,
	john.g.garry@oracle.com, edumazet@google.com,
	ncardwell@google.com, kuniyu@google.com, davem@davemloft.net,
	dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org, netdev@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org

On 11/25/25 12:54, Nilay Shroff wrote:
> 
> 
> On 11/25/25 5:00 PM, Christoph Hellwig wrote:
>> On Tue, Nov 25, 2025 at 04:58:32PM +0530, Nilay Shroff wrote:
>>> >From git history, I see that was added to avoid memory reclaim  to avoid
>>> possible circular locking dependency. This commit 83e1226b0ee2 ("nvme-tcp:
>>> fix possible circular locking when deleting a controller under memory
>>> pressure") adds it.
>>
>> I suspect this was intended to be noio, and we should just switch to
>> that.
>>
> Yeah, I agree that this should be changed to noio. However, it seems that
> this alone may not be sufficient to fix the lockdep splat reported here.
> Since the real work of fput() might be deferred to another thread, using a noio
> scope in this path may not have the intended effect and could end up being
> redundant.
> 
> That said, I noticed another fix from Chaitanya [1], where fput() is replaced
> with __fput_sync(). With that change in place, combining the noio scope adjustment
> with the switch to __fput_sync() should indeed address the lockdep issue. Given
> that both problems have very similar lockdep signatures, it might make sense to
> merge these two changes into a single fix.
> 
> [1] https://lore.kernel.org/all/20251125005950.41046-1-ckulkarnilinux@gmail.com/
> 
Yes, we should.
Both address similar symptoms, and I wouldn't be surprised both turn out
to address the same issue.
(Frame #0 from here is basically identical to frame #2 in the referenced
issue).

So please roll both into one patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-25 12:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25  6:11 [RFC blktests fix PATCH] tcp: use GFP_ATOMIC in tcp_disconnect Chaitanya Kulkarni
2025-11-25  6:27 ` Christoph Hellwig
2025-11-25  7:28   ` Chaitanya Kulkarni
2025-11-25 11:08     ` hch
2025-11-25 11:13     ` Nilay Shroff
2025-11-25 11:21       ` hch
2025-11-25 11:28         ` Nilay Shroff
2025-11-25 11:30           ` Christoph Hellwig
2025-11-25 11:54             ` Nilay Shroff
2025-11-25 12:33               ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).