public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Chaitanya Kulkarni <kch@nvidia.com>, kbusch@kernel.org, sagi@grimberg.me
Cc: hch@lst.de, linux-nvme@lists.infradead.org
Subject: Re: [PATCH V4] nvme-tcp: teardown circular lockng fixes
Date: Tue, 14 Apr 2026 12:08:46 +0200	[thread overview]
Message-ID: <a105337a-b5ae-4892-953d-96540d3ddd29@suse.de> (raw)
In-Reply-To: <20260413171628.6204-1-kch@nvidia.com>

On 4/13/26 19:16, Chaitanya Kulkarni wrote:
> When a controller reset is triggered via sysfs (by writing to
> /sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down
> and re-establishes all queues. The socket release using fput() defers
> the actual cleanup to task_work delayed_fput workqueue. This deferred
> cleanup can race with the subsequent queue re-allocation during reset,
> potentially leading to use-after-free or resource conflicts.
> 
> Replace fput() with __fput_sync() to ensure synchronous socket release,
> guaranteeing that all socket resources are fully cleaned up before the
> function returns. This prevents races during controller reset where
> new queue setup may begin before the old socket is fully released.
> 
> * Call chain during reset:
>    nvme_reset_ctrl_work()
>      -> nvme_tcp_teardown_ctrl()
>        -> nvme_tcp_teardown_io_queues()
>          -> nvme_tcp_free_io_queues()
>            -> nvme_tcp_free_queue()       <-- fput() -> __fput_sync()
>        -> nvme_tcp_teardown_admin_queue()
>          -> nvme_tcp_free_admin_queue()
>            -> nvme_tcp_free_queue()       <-- fput() -> __fput_sync()
>      -> nvme_tcp_setup_ctrl()             <-- race with deferred fput
> 
> memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks
> performing memory reclaim work that need reserve access. While PF_MEMALLOC
> prevents the task from entering direct reclaim (causing __need_reclaim() to
> return false), it does not strip __GFP_IO from gfp flags. The allocator can
> therefore still trigger writeback I/O when __GFP_IO remains set, which is
> unsafe when the caller holds block layer locks.
> 
> Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes
> current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in
> the scope, making it safe to allocate memory while holding elevator_lock and
> set->srcu.
> 
> * The issue can be reproduced using blktests:
> 
>    nvme_trtype=tcp ./check nvme/005
> blktests (master) # nvme_trtype=tcp ./check nvme/005
> nvme/005 (tr=tcp) (reset local loopback target)              [failed]
>      runtime  0.725s  ...  0.798s
>      something found in dmesg:
>      [  108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
> 
>      [...]
>      ...
>      (See '/root/blktests/results/nodev_tr_tcp/nvme/005.dmesg' for the entire message)
> blktests (master) # cat /root/blktests/results/nodev_tr_tcp/nvme/005.dmesg
> [  108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
> [  108.526983] loop0: detected capacity change from 0 to 2097152
> [  108.555606] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [  108.572531] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> [  108.613061] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
> [  108.616832] nvme nvme0: creating 48 I/O queues.
> [  108.630791] nvme nvme0: mapped 48/0/0 default/read/poll queues.
> [  108.661892] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
> [  108.746639] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
> [  108.748466] nvme nvme0: creating 48 I/O queues.
> [  108.802984] nvme nvme0: mapped 48/0/0 default/read/poll queues.
> [  108.829983] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> [  108.854288] block nvme0n1: no available path - failing I/O
> [  108.854344] block nvme0n1: no available path - failing I/O
> [  108.854373] Buffer I/O error on dev nvme0n1, logical block 1, async page read
> 
> [  108.891693] ======================================================
> [  108.895912] WARNING: possible circular locking dependency detected
> [  108.900184] 6.17.0nvme+ #3 Tainted: G                 N
> [  108.903913] ------------------------------------------------------
> [  108.908171] nvme/2734 is trying to acquire lock:
> [  108.911957] ffff88810210e610 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x17/0x170
> [  108.917587]
>                 but task is already holding lock:
> [  108.921570] ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
> [  108.927361]
>                 which lock already depends on the new lock.
> 
> [  108.933018]
>                 the existing dependency chain (in reverse order) is:
> [  108.938223]
>                 -> #4 (&q->elevator_lock){+.+.}-{4:4}:
> [  108.942988]        __mutex_lock+0xa2/0x1150
> [  108.945873]        elevator_change+0xa8/0x1c0
> [  108.948925]        elv_iosched_store+0xdf/0x140
> [  108.952043]        kernfs_fop_write_iter+0x16a/0x220
> [  108.955367]        vfs_write+0x378/0x520
> [  108.957598]        ksys_write+0x67/0xe0
> [  108.959721]        do_syscall_64+0x76/0xbb0
> [  108.962052]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  108.965145]
>                 -> #3 (&q->q_usage_counter(io)){++++}-{0:0}:
> [  108.968923]        blk_alloc_queue+0x30e/0x350
> [  108.972117]        blk_mq_alloc_queue+0x61/0xd0
> [  108.974677]        scsi_alloc_sdev+0x2a0/0x3e0
> [  108.977092]        scsi_probe_and_add_lun+0x1bd/0x430
> [  108.979921]        __scsi_add_device+0x109/0x120
> [  108.982504]        ata_scsi_scan_host+0x97/0x1c0
> [  108.984365]        async_run_entry_fn+0x2d/0x130
> [  108.986109]        process_one_work+0x20e/0x630
> [  108.987830]        worker_thread+0x184/0x330
> [  108.989473]        kthread+0x10a/0x250
> [  108.990852]        ret_from_fork+0x297/0x300
> [  108.992491]        ret_from_fork_asm+0x1a/0x30
> [  108.994159]
>                 -> #2 (fs_reclaim){+.+.}-{0:0}:
> [  108.996320]        fs_reclaim_acquire+0x99/0xd0
> [  108.998058]        kmem_cache_alloc_node_noprof+0x4e/0x3c0
> [  109.000123]        __alloc_skb+0x15f/0x190
> [  109.002195]        tcp_send_active_reset+0x3f/0x1e0
> [  109.004038]        tcp_disconnect+0x50b/0x720
> [  109.005695]        __tcp_close+0x2b8/0x4b0
> [  109.007227]        tcp_close+0x20/0x80
> [  109.008663]        inet_release+0x31/0x60
> [  109.010175]        __sock_release+0x3a/0xc0
> [  109.011778]        sock_close+0x14/0x20
> [  109.013263]        __fput+0xee/0x2c0
> [  109.014673]        delayed_fput+0x31/0x50
> [  109.016183]        process_one_work+0x20e/0x630
> [  109.017897]        worker_thread+0x184/0x330
> [  109.019543]        kthread+0x10a/0x250
> [  109.020929]        ret_from_fork+0x297/0x300
> [  109.022565]        ret_from_fork_asm+0x1a/0x30
> [  109.024194]
>                 -> #1 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
> [  109.026634]        lock_sock_nested+0x2e/0x70
> [  109.028251]        tcp_sendmsg+0x1a/0x40
> [  109.029783]        sock_sendmsg+0xed/0x110
> [  109.031321]        nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
> [  109.034263]        nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
> [  109.036375]        nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
> [  109.038528]        blk_mq_dispatch_rq_list+0x297/0x800
> [  109.040448]        __blk_mq_sched_dispatch_requests+0x3db/0x5f0
> [  109.042677]        blk_mq_sched_dispatch_requests+0x29/0x70
> [  109.044787]        blk_mq_run_work_fn+0x76/0x1b0
> [  109.046535]        process_one_work+0x20e/0x630
> [  109.048245]        worker_thread+0x184/0x330
> [  109.049890]        kthread+0x10a/0x250
> [  109.051331]        ret_from_fork+0x297/0x300
> [  109.053024]        ret_from_fork_asm+0x1a/0x30
> [  109.054740]
>                 -> #0 (set->srcu){.+.+}-{0:0}:
> [  109.056850]        __lock_acquire+0x1468/0x2210
> [  109.058614]        lock_sync+0xa5/0x110
> [  109.060048]        __synchronize_srcu+0x49/0x170
> [  109.061802]        elevator_switch+0xc9/0x330
> [  109.063950]        elevator_change+0x128/0x1c0
> [  109.065675]        elevator_set_none+0x4c/0x90
> [  109.067316]        blk_unregister_queue+0xa8/0x110
> [  109.069165]        __del_gendisk+0x14e/0x3c0
> [  109.070824]        del_gendisk+0x75/0xa0
> [  109.072328]        nvme_ns_remove+0xf2/0x230 [nvme_core]
> [  109.074365]        nvme_remove_namespaces+0xf2/0x150 [nvme_core]
> [  109.076652]        nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
> [  109.078775]        nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
> [  109.081009]        nvme_sysfs_delete+0x34/0x40 [nvme_core]
> [  109.083082]        kernfs_fop_write_iter+0x16a/0x220
> [  109.085009]        vfs_write+0x378/0x520
> [  109.086539]        ksys_write+0x67/0xe0
> [  109.087982]        do_syscall_64+0x76/0xbb0
> [  109.089577]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  109.091665]
>                 other info that might help us debug this:
> 
> [  109.095478] Chain exists of:
>                   set->srcu --> &q->q_usage_counter(io) --> &q->elevator_lock
> 
> [  109.099544]  Possible unsafe locking scenario:
> 
> [  109.101708]        CPU0                    CPU1
> [  109.103402]        ----                    ----
> [  109.105103]   lock(&q->elevator_lock);
> [  109.106530]                                lock(&q->q_usage_counter(io));
> [  109.109022]                                lock(&q->elevator_lock);
> [  109.111391]   sync(set->srcu);
> [  109.112586]
>                  *** DEADLOCK ***
> 
> [  109.114772] 5 locks held by nvme/2734:
> [  109.116189]  #0: ffff888101925410 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x67/0xe0
> [  109.119143]  #1: ffff88817a914e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x10f/0x220
> [  109.123141]  #2: ffff8881046313f8 (kn->active#185){++++}-{0:0}, at: sysfs_remove_file_self+0x26/0x50
> [  109.126543]  #3: ffff88810470e1d0 (&set->update_nr_hwq_lock){++++}-{4:4}, at: del_gendisk+0x6d/0xa0
> [  109.129891]  #4: ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
> [  109.133149]
>                 stack backtrace:
> [  109.134817] CPU: 6 UID: 0 PID: 2734 Comm: nvme Tainted: G                 N  6.17.0nvme+ #3 PREEMPT(voluntary)
> [  109.134819] Tainted: [N]=TEST
> [  109.134820] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [  109.134821] Call Trace:
> [  109.134823]  <TASK>
> [  109.134824]  dump_stack_lvl+0x75/0xb0
> [  109.134828]  print_circular_bug+0x26a/0x330
> [  109.134831]  check_noncircular+0x12f/0x150
> [  109.134834]  __lock_acquire+0x1468/0x2210
> [  109.134837]  ? __synchronize_srcu+0x17/0x170
> [  109.134838]  lock_sync+0xa5/0x110
> [  109.134840]  ? __synchronize_srcu+0x17/0x170
> [  109.134842]  __synchronize_srcu+0x49/0x170
> [  109.134843]  ? mark_held_locks+0x49/0x80
> [  109.134845]  ? _raw_spin_unlock_irqrestore+0x2d/0x60
> [  109.134847]  ? kvm_clock_get_cycles+0x14/0x30
> [  109.134853]  ? ktime_get_mono_fast_ns+0x36/0xb0
> [  109.134858]  elevator_switch+0xc9/0x330
> [  109.134860]  elevator_change+0x128/0x1c0
> [  109.134862]  ? kernfs_put.part.0+0x86/0x290
> [  109.134864]  elevator_set_none+0x4c/0x90
> [  109.134866]  blk_unregister_queue+0xa8/0x110
> [  109.134868]  __del_gendisk+0x14e/0x3c0
> [  109.134870]  del_gendisk+0x75/0xa0
> [  109.134872]  nvme_ns_remove+0xf2/0x230 [nvme_core]
> [  109.134879]  nvme_remove_namespaces+0xf2/0x150 [nvme_core]
> [  109.134887]  nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
> [  109.134893]  nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
> [  109.134899]  nvme_sysfs_delete+0x34/0x40 [nvme_core]
> [  109.134905]  kernfs_fop_write_iter+0x16a/0x220
> [  109.134908]  vfs_write+0x378/0x520
> [  109.134911]  ksys_write+0x67/0xe0
> [  109.134913]  do_syscall_64+0x76/0xbb0
> [  109.134915]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  109.134916] RIP: 0033:0x7fd68a737317
> [  109.134917] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [  109.134919] RSP: 002b:00007ffded1546d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  109.134920] RAX: ffffffffffffffda RBX: 000000000054f7e0 RCX: 00007fd68a737317
> [  109.134921] RDX: 0000000000000001 RSI: 00007fd68a855719 RDI: 0000000000000003
> [  109.134921] RBP: 0000000000000003 R08: 0000000030407850 R09: 00007fd68a7cd4e0
> [  109.134922] R10: 00007fd68a65b130 R11: 0000000000000246 R12: 00007fd68a855719
> [  109.134923] R13: 00000000304074c0 R14: 00000000304074c0 R15: 0000000030408660
> [  109.134926]  </TASK>
> [  109.962756] Key type psk unregistered
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


      parent reply	other threads:[~2026-04-14 10:09 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-13 17:16 [PATCH V4] nvme-tcp: teardown circular lockng fixes Chaitanya Kulkarni
2026-04-13 17:19 ` Chaitanya Kulkarni
2026-04-14 10:08 ` Hannes Reinecke [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a105337a-b5ae-4892-953d-96540d3ddd29@suse.de \
    --to=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox