nvme-6.3: blktests nvme/003 null pointer for fc transport

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* nvme-6.3: blktests nvme/003 null pointer for fc transport
@ 2023-01-30 11:15 Daniel Wagner
  2023-01-31 23:27 ` James Smart
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Wagner @ 2023-01-30 11:15 UTC (permalink / raw)
  To: linux-nvme@lists.infradead.org
  Cc: Shinichiro Kawasaki, Hannes Reinecke, James Smart

Hi,

Hannes added support to blktest to enable the FC transport for it. And as result
blktests is able to crash the kernel. Haven't startet to look into it yet.

 run blktests nvme/003 at 2023-01-30 12:08:28
 nvmet: adding nsid 1 to subsystem blktests-subsystem-1
 BUG: kernel NULL pointer dereference, address: 0000000000000010
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 3 PID: 14200 Comm: nvme Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:nvme_alloc_admin_tag_set+0x38/0x120 [nvme_core]
 Code: 48 89 d5 48 89 f3 49 89 fe ba 30 01 00 00 48 89 f7 31 f6 e8 9a a9 0a df 48 89 6b 38 c7 43 44 1e 00 00 00 49 8b 86 f8 00 00 00 <f6> 40 10 01 74 07 c7 43 48 01 00 00 00 41 8b 86 24 01 00 00 89 43
 RSP: 0018:ffffa9d341a07c48 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff902d57966090 RCX: 0000000000000000
 RDX: 0000000000000130 RSI: 0000000000000000 RDI: ffff902d579661c0
 RBP: ffffffffc0c08578 R08: 0000000000000dc0 R09: ffff902d57966090
 R10: ffffffff9f379a17 R11: 0000000000000010 R12: ffff902d579665a8
 R13: ffff902d7226a800 R14: ffff902d579665a8 R15: 00000000000001d8
 FS:  00007f50d94c1780(0000) GS:ffff902eb1400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000010 CR3: 00000001056d2001 CR4: 0000000000170ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  nvme_fc_init_ctrl+0x444/0x620 [nvme_fc e229b68788426766352a96e834fb25104e3bb609]
  nvme_fc_create_ctrl+0x1e9/0x250 [nvme_fc e229b68788426766352a96e834fb25104e3bb609]
  ? rcu_read_lock_sched_held+0xb/0x70
  nvmf_dev_write+0x22d/0x330 [nvme_fabrics 903efb754406e989596ad7b178f3d4a61b162a0e]
  vfs_write+0x10d/0x2f0
  ? do_sys_openat2+0x128/0x150
  ksys_write+0x6c/0xd0
  do_syscall_64+0x60/0x90
  ? do_user_addr_fault+0x3bf/0x510
  ? rcu_read_lock_sched_held+0xb/0x70
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f50d7f06af3
 Code: 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 f3 c3 0f 1f 00 41 54 55 49 89 d4 53 48 89
 RSP: 002b:00007fff10857428 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000560c887baeb0 RCX: 00007f50d7f06af3
 RDX: 000000000000012b RSI: 0000560c887baeb0 RDI: 0000000000000003
 RBP: 0000000000000003 R08: 000000000000012b R09: 0000560c887baeb0
 R10: 00007f50d7e07468 R11: 0000000000000246 R12: 0000560c887b7970
 R13: 000000000000012b R14: 00007f50d94eb5a0 R15: 00007f50d94dd520
  </TASK>
 Modules linked in: loop nvme_fcloop nvmet_fc nvmet nvme_fc nvme_fabrics nvme_core nvme_common rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache netfs af_packet rfkill nls_iso8859_1 nls_cp437 vfat intel_rapl_msr fat intel_rapl_common snd_hda_codec_generic kvm_intel snd_hda_intel iTCO_wdt intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support snd_hda_codec kvm snd_hwdep joydev snd_hda_core irqbypass snd_pcm pcspkr snd_timer i2c_i801 lpc_ich efi_pstore i2c_smbus snd virtio_net soundcore tiny_power_button net_failover virtio_balloon failover button fuse configfs ip_tables x_tables hid_generic usbhid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd serio_raw xhci_pci sr_mod xhci_pci_renesas cdrom virtio_blk xhci_hcd usbcore virtio_gpu virtio_dma_buf qemu_fw_cfg btrfs libcrc32c crc32c_intel xor zlib_deflate raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs virtio_rng
  [last unloaded: nvme_common]
 CR2: 0000000000000010
 ---[ end trace 0000000000000000 ]---

Thanks,
Daniel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nvme-6.3: blktests nvme/003 null pointer for fc transport
  2023-01-30 11:15 nvme-6.3: blktests nvme/003 null pointer for fc transport Daniel Wagner
@ 2023-01-31 23:27 ` James Smart
  2023-02-01  8:10   ` Daniel Wagner
  0 siblings, 1 reply; 5+ messages in thread
From: James Smart @ 2023-01-31 23:27 UTC (permalink / raw)
  To: Daniel Wagner, linux-nvme@lists.infradead.org
  Cc: Shinichiro Kawasaki, Hannes Reinecke

On 1/30/2023 3:15 AM, Daniel Wagner wrote:
> Hi,
> 
> Hannes added support to blktest to enable the FC transport for it. And as result
> blktests is able to crash the kernel. Haven't startet to look into it yet.
> 
>   run blktests nvme/003 at 2023-01-30 12:08:28
>   nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>   BUG: kernel NULL pointer dereference, address: 0000000000000010
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 0 P4D 0
>   Oops: 0000 [#1] PREEMPT SMP NOPTI
>   CPU: 3 PID: 14200 Comm: nvme Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:nvme_alloc_admin_tag_set+0x38/0x120 [nvme_core]

bingo...

The cleanup patch "nvme-fc: use the tagset alloc/free helpers" 
introduced it.

See the fix patch "nvme-fc: Fix initialization order" about 10 days ago 
that corrects it.

-- james





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nvme-6.3: blktests nvme/003 null pointer for fc transport
  2023-01-31 23:27 ` James Smart
@ 2023-02-01  8:10   ` Daniel Wagner
  2023-02-01  8:24     ` Hannes Reinecke
  2023-02-20 16:04     ` Daniel Wagner
  0 siblings, 2 replies; 5+ messages in thread
From: Daniel Wagner @ 2023-02-01  8:10 UTC (permalink / raw)
  To: James Smart
  Cc: linux-nvme@lists.infradead.org, Shinichiro Kawasaki,
	Hannes Reinecke

On Tue, Jan 31, 2023 at 03:27:20PM -0800, James Smart wrote:
> On 1/30/2023 3:15 AM, Daniel Wagner wrote:
> > Hi,
> > 
> > Hannes added support to blktest to enable the FC transport for it. And as result
> > blktests is able to crash the kernel. Haven't startet to look into it yet.
> > 
> >   run blktests nvme/003 at 2023-01-30 12:08:28
> >   nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> >   BUG: kernel NULL pointer dereference, address: 0000000000000010
> >   #PF: supervisor read access in kernel mode
> >   #PF: error_code(0x0000) - not-present page
> >   PGD 0 P4D 0
> >   Oops: 0000 [#1] PREEMPT SMP NOPTI
> >   CPU: 3 PID: 14200 Comm: nvme Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >   RIP: 0010:nvme_alloc_admin_tag_set+0x38/0x120 [nvme_core]
> 
> bingo...
> 
> The cleanup patch "nvme-fc: use the tagset alloc/free helpers" introduced
> it.
> 
> See the fix patch "nvme-fc: Fix initialization order" about 10 days ago that
> corrects it.

Good call! This is problem is gone. Now I am running into the next problem:


run blktests nvme/003 at 2023-02-01 09:01:58
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
debugfs: Directory 'nvme0' with parent '/' already present!
nvme0: failed to create debugfs attr
nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:0} Association created
nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:2cd2c43b-a90a-45c1-a8cd-86b33ab273b5.
nvme nvme0: NVME-FC{0}: controller connect complete
nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:1} Association created
nvmet: creating nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:2cd2c43b-a90a-45c1-a8cd-86b33ab273b5.
nvme nvme1: NVME-FC{1}: controller connect complete
nvme nvme1: NVME-FC{1}: new ctrl: NQN "blktests-subsystem-1"
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: rescanning namespaces.
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
nvme nvme1: NVME-FC{1}: io failed due to lldd error 6
nvme nvme1: NVME-FC{1}: transport association event: transport detected io error
nvme nvme1: NVME-FC{1}: resetting controller
debugfs: Directory 'nvme0' with parent '/' already present!
nvme0: failed to create debugfs attr
nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:0} Association created
nvmet: connect request for invalid subsystem nqn.2014-08.org.nvmexpress.discovery!
nvme nvme0: Connect Invalid Data Parameter, subsysnqn "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (16770)
nvme nvme0: NVME-FC{0}: reconnect failure
nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:2} Association created
nvmet: connect request for invalid subsystem blktests-subsystem-1!
nvme nvme1: Connect Invalid Data Parameter, subsysnqn "blktests-subsystem-1"
nvme nvme1: NVME-FC{1}: reset: Reconnect attempt failed (16770)
nvme nvme1: NVME-FC{1}: reconnect failure
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
(NULL device *): {0:1} Association deleted
(NULL device *): {0:1} Association freed
(NULL device *): Disconnect LS failed: No Association
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
(NULL device *): {0:2} Association deleted
(NULL device *): {0:2} Association freed
===========================================
WARNING: possible recursive locking detected
6.2.0-rc4+ #1 Not tainted
--------------------------------------------
kworker/3:13/21736 is trying to acquire lock:
ffff93fe60e251e8 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: __flush_work+0x38/0xc0

but task is already holding lock:
ffffa693c1a4fe68 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: process_one_work+0x236/0x410

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock((work_completion)(&tport->ls_work));
  lock((work_completion)(&tport->ls_work));

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by kworker/3:13/21736:
 #0: ffff93fe9882ff48 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x210/0x410
 #1: ffffa693c1a4fe68 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: process_one_work+0x236/0x410

stack backtrace:
CPU: 3 PID: 21736 Comm: kworker/3:13 Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: nvmet-wq fcloop_tport_lsrqst_work [nvme_fcloop]
Call Trace:
 <TASK>
 dump_stack_lvl+0x5a/0x80
 validate_chain+0x1e02/0x2730
 ? lockdep_unlock+0x51/0xc0
 ? lockdep_unlock+0x51/0xc0
 ? validate_chain+0x7c8/0x2730
 ? validate_chain+0x7c8/0x2730
 ? lock_release+0xf7/0x400
 ? __lock_acquire+0x8f7/0xbc0
 ? lock_is_held_type+0x8b/0x110
 __lock_acquire+0x8f7/0xbc0
 ? lock_is_held_type+0x8b/0x110
 lock_acquire+0xb0/0x180
 ? __flush_work+0x38/0xc0
 ? lock_is_held_type+0x8b/0x110
 __flush_work+0x57/0xc0
 ? __flush_work+0x38/0xc0
 ? lockdep_hardirqs_on_prepare+0x16e/0x220
 ? _raw_spin_unlock_irqrestore+0x32/0x50
 ? __free_pages_ok+0x289/0x4a0
 fcloop_targetport_delete+0x15/0xa0 [nvme_fcloop 2f0f07bf6746e7b4ec6dff8d226e919dac60e0f1]
 nvmet_fc_tgtport_put+0x132/0x190 [nvmet_fc cec46c808c67bf2a4fefe925da2828f0b3f2334a]
 nvmet_fc_disconnect_assoc_done+0x7c/0xa0 [nvmet_fc cec46c808c67bf2a4fefe925da2828f0b3f2334a]
 fcloop_tport_lsrqst_work+0x6c/0xa0 [nvme_fcloop 2f0f07bf6746e7b4ec6dff8d226e919dac60e0f1]
 process_one_work+0x26a/0x410
 worker_thread+0x21d/0x490
 ? rcu_lock_release+0x20/0x20
 kthread+0xe1/0x100
 ? kthread_blkcg+0x30/0x30
 ret_from_fork+0x1f/0x30
 </TASK>
nvme_fcloop: fcloop_exit: Failed deleting target port
nvme_fcloop: fcloop_exit: Failed deleting remote port
nvme_fcloop: fcloop_exit: Failed deleting target port
nvme_fcloop: fcloop_exit: Failed deleting remote port
nvme_fcloop: fcloop_exit: Failed deleting target port
nvme_fcloop: fcloop_exit: Failed deleting remote port


The last last to message never stop.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nvme-6.3: blktests nvme/003 null pointer for fc transport
  2023-02-01  8:10   ` Daniel Wagner
@ 2023-02-01  8:24     ` Hannes Reinecke
  2023-02-20 16:04     ` Daniel Wagner
  1 sibling, 0 replies; 5+ messages in thread
From: Hannes Reinecke @ 2023-02-01  8:24 UTC (permalink / raw)
  To: Daniel Wagner, James Smart
  Cc: linux-nvme@lists.infradead.org, Shinichiro Kawasaki

On 2/1/23 09:10, Daniel Wagner wrote:
> On Tue, Jan 31, 2023 at 03:27:20PM -0800, James Smart wrote:
>> On 1/30/2023 3:15 AM, Daniel Wagner wrote:
>>> Hi,
>>>
>>> Hannes added support to blktest to enable the FC transport for it. And as result
>>> blktests is able to crash the kernel. Haven't startet to look into it yet.
>>>
>>>    run blktests nvme/003 at 2023-01-30 12:08:28
>>>    nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>    BUG: kernel NULL pointer dereference, address: 0000000000000010
>>>    #PF: supervisor read access in kernel mode
>>>    #PF: error_code(0x0000) - not-present page
>>>    PGD 0 P4D 0
>>>    Oops: 0000 [#1] PREEMPT SMP NOPTI
>>>    CPU: 3 PID: 14200 Comm: nvme Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
>>>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>    RIP: 0010:nvme_alloc_admin_tag_set+0x38/0x120 [nvme_core]
>>
>> bingo...
>>
>> The cleanup patch "nvme-fc: use the tagset alloc/free helpers" introduced
>> it.
>>
>> See the fix patch "nvme-fc: Fix initialization order" about 10 days ago that
>> corrects it.
> 
> Good call! This is problem is gone. Now I am running into the next problem:
> 
> 
> run blktests nvme/003 at 2023-02-01 09:01:58
> nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> debugfs: Directory 'nvme0' with parent '/' already present!
> nvme0: failed to create debugfs attr
> nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
> (NULL device *): {0:0} Association created
> nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:2cd2c43b-a90a-45c1-a8cd-86b33ab273b5.
> nvme nvme0: NVME-FC{0}: controller connect complete
> nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
> (NULL device *): {0:1} Association created
> nvmet: creating nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:2cd2c43b-a90a-45c1-a8cd-86b33ab273b5.
> nvme nvme1: NVME-FC{1}: controller connect complete
> nvme nvme1: NVME-FC{1}: new ctrl: NQN "blktests-subsystem-1"
> nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme1: rescanning namespaces.
> (NULL device *): {0:0} Association deleted
> (NULL device *): {0:0} Association freed
> (NULL device *): Disconnect LS failed: No Association
> nvme nvme1: NVME-FC{1}: io failed due to lldd error 6
> nvme nvme1: NVME-FC{1}: transport association event: transport detected io error
> nvme nvme1: NVME-FC{1}: resetting controller
> debugfs: Directory 'nvme0' with parent '/' already present!
> nvme0: failed to create debugfs attr
> nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
> (NULL device *): {0:0} Association created
> nvmet: connect request for invalid subsystem nqn.2014-08.org.nvmexpress.discovery!
> nvme nvme0: Connect Invalid Data Parameter, subsysnqn "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (16770)
> nvme nvme0: NVME-FC{0}: reconnect failure
> nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
> (NULL device *): {0:2} Association created
> nvmet: connect request for invalid subsystem blktests-subsystem-1!
> nvme nvme1: Connect Invalid Data Parameter, subsysnqn "blktests-subsystem-1"
> nvme nvme1: NVME-FC{1}: reset: Reconnect attempt failed (16770)
> nvme nvme1: NVME-FC{1}: reconnect failure
> nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
> (NULL device *): {0:1} Association deleted
> (NULL device *): {0:1} Association freed
> (NULL device *): Disconnect LS failed: No Association
> (NULL device *): {0:0} Association deleted
> (NULL device *): {0:0} Association freed
> (NULL device *): Disconnect LS failed: No Association
> (NULL device *): {0:2} Association deleted
> (NULL device *): {0:2} Association freed
> ===========================================
> WARNING: possible recursive locking detected
> 6.2.0-rc4+ #1 Not tainted
> --------------------------------------------
> kworker/3:13/21736 is trying to acquire lock:
> ffff93fe60e251e8 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: __flush_work+0x38/0xc0
> 
> but task is already holding lock:
> ffffa693c1a4fe68 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: process_one_work+0x236/0x410
> 
> other info that might help us debug this:
>   Possible unsafe locking scenario:
> 
>         CPU0
>         ----
>    lock((work_completion)(&tport->ls_work));
>    lock((work_completion)(&tport->ls_work));
> 
>   *** DEADLOCK ***
> 
>   May be due to missing lock nesting notation
> 
> 2 locks held by kworker/3:13/21736:
>   #0: ffff93fe9882ff48 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x210/0x410
>   #1: ffffa693c1a4fe68 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: process_one_work+0x236/0x410
> 
> stack backtrace:
> CPU: 3 PID: 21736 Comm: kworker/3:13 Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> Workqueue: nvmet-wq fcloop_tport_lsrqst_work [nvme_fcloop]
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5a/0x80
>   validate_chain+0x1e02/0x2730
>   ? lockdep_unlock+0x51/0xc0
>   ? lockdep_unlock+0x51/0xc0
>   ? validate_chain+0x7c8/0x2730
>   ? validate_chain+0x7c8/0x2730
>   ? lock_release+0xf7/0x400
>   ? __lock_acquire+0x8f7/0xbc0
>   ? lock_is_held_type+0x8b/0x110
>   __lock_acquire+0x8f7/0xbc0
>   ? lock_is_held_type+0x8b/0x110
>   lock_acquire+0xb0/0x180
>   ? __flush_work+0x38/0xc0
>   ? lock_is_held_type+0x8b/0x110
>   __flush_work+0x57/0xc0
>   ? __flush_work+0x38/0xc0
>   ? lockdep_hardirqs_on_prepare+0x16e/0x220
>   ? _raw_spin_unlock_irqrestore+0x32/0x50
>   ? __free_pages_ok+0x289/0x4a0
>   fcloop_targetport_delete+0x15/0xa0 [nvme_fcloop 2f0f07bf6746e7b4ec6dff8d226e919dac60e0f1]
>   nvmet_fc_tgtport_put+0x132/0x190 [nvmet_fc cec46c808c67bf2a4fefe925da2828f0b3f2334a]
>   nvmet_fc_disconnect_assoc_done+0x7c/0xa0 [nvmet_fc cec46c808c67bf2a4fefe925da2828f0b3f2334a]
>   fcloop_tport_lsrqst_work+0x6c/0xa0 [nvme_fcloop 2f0f07bf6746e7b4ec6dff8d226e919dac60e0f1]
>   process_one_work+0x26a/0x410
>   worker_thread+0x21d/0x490
>   ? rcu_lock_release+0x20/0x20
>   kthread+0xe1/0x100
>   ? kthread_blkcg+0x30/0x30
>   ret_from_fork+0x1f/0x30
>   </TASK>
> nvme_fcloop: fcloop_exit: Failed deleting target port
> nvme_fcloop: fcloop_exit: Failed deleting remote port
> nvme_fcloop: fcloop_exit: Failed deleting target port
> nvme_fcloop: fcloop_exit: Failed deleting remote port
> nvme_fcloop: fcloop_exit: Failed deleting target port
> nvme_fcloop: fcloop_exit: Failed deleting remote port
> 
> 
> The last last to message never stop.

And this is why I wanted to have these tests :-)
In my original test I had this:

+       _remove_nvmet_subsystem_from_port "${port1}" "${subsys}"
+       _remove_nvmet_subsystem_from_port "${port2}" "${subsys}"
+       _remove_nvmet_subsystem "${subsys}"
+       _remove_nvmet_anagroup "${port1}" "${ag1}"
+       _remove_nvmet_port "${port1}"
+       _remove_nvmet_anagroup "${port2}" "${ag2}"
+       _remove_nvmet_port "${port2}"
+
+       losetup -d "$loop_dev"
+       rm "$TMPDIR/img"
+
+       # Fixme: need to wait for RCU grace period
+       sleep 5
+
+       _nvme_fcloop_del_rport "${host_wwnn1}" "${host_wwpn1}" \
+               "${remote_wwnn1}" "${remote_wwpn1}"

Problem is a subtle workqueue inversion with fcloop. I'll have to dig 
out details here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nvme-6.3: blktests nvme/003 null pointer for fc transport
  2023-02-01  8:10   ` Daniel Wagner
  2023-02-01  8:24     ` Hannes Reinecke
@ 2023-02-20 16:04     ` Daniel Wagner
  1 sibling, 0 replies; 5+ messages in thread
From: Daniel Wagner @ 2023-02-20 16:04 UTC (permalink / raw)
  To: James Smart
  Cc: linux-nvme@lists.infradead.org, Shinichiro Kawasaki,
	Hannes Reinecke, Ming Lei

On Wed, Feb 01, 2023 at 09:10:03AM +0100, Daniel Wagner wrote:
> nvme_fcloop: fcloop_exit: Failed deleting target port
> nvme_fcloop: fcloop_exit: Failed deleting remote port
> nvme_fcloop: fcloop_exit: Failed deleting target port
> nvme_fcloop: fcloop_exit: Failed deleting remote port
> nvme_fcloop: fcloop_exit: Failed deleting target port
> nvme_fcloop: fcloop_exit: Failed deleting remote port

I've retested with v6.2.0 with

--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -1607,6 +1607,11 @@ static void __exit fcloop_exit(void)
                tport = __unlink_target_port(nport);
                rport = __unlink_remote_port(nport);

+                if (!tport && !rport) {
+                        pr_warn("%s: target port and remote port have already been freed\n", __func__);
+                        break;
+                }
+
                spin_unlock_irqrestore(&fcloop_lock, flags);

                ret = __targetport_unreg(nport, tport);

but I was not able to trigger it anymore. My guess is

0cab4404874f ("nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association")

fixed the problem.

Though I run into a lockdep warning:


 ================================
 WARNING: inconsistent lock state
 6.2.0+ #4 Not tainted
 --------------------------------
 inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
 ksoftirqd/1/23 [HC0[0]:SC1[1]:HE0:SE0] takes:
 ffff88817e63a160 (&fq->mq_flush_lock){?.-.}-{2:2}, at: flush_end_io+0x69/0x460
 {IN-HARDIRQ-W} state was registered at:
   lock_acquire+0x128/0x270
   _raw_spin_lock_irqsave+0x55/0x90
   flush_end_io+0x69/0x460
   __blk_mq_end_request+0x16c/0x190
   virtblk_done+0x12e/0x220 [virtio_blk]
   vring_interrupt+0xc0/0x140
   __handle_irq_event_percpu+0x10c/0x230
   handle_irq_event+0x4e/0xb0
   handle_edge_irq+0x14a/0x3f0
   __common_interrupt+0x8a/0x120
   common_interrupt+0x7b/0x90
   asm_common_interrupt+0x22/0x40
   native_safe_halt+0xb/0x10
   default_idle+0xa/0x10
   default_idle_call+0x70/0xc0
   do_idle+0x185/0x450
   cpu_startup_entry+0x25/0x30
   start_secondary+0xbd/0xc0
   secondary_startup_64_no_verify+0xe6/0xeb
 irq event stamp: 781843
 hardirqs last  enabled at (781842): [<ffffffff8e69f6a1>] seqcount_lockdep_reader_access+0x91/0xa0
 hardirqs last disabled at (781843): [<ffffffff8f99ca0c>] _raw_spin_lock_irqsave+0x2c/0x90
 softirqs last  enabled at (781834): [<ffffffff8e52b143>] run_ksoftirqd+0x33/0x70
 softirqs last disabled at (781839): [<ffffffff8e52b143>] run_ksoftirqd+0x33/0x70

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&fq->mq_flush_lock);
   <Interrupt>
     lock(&fq->mq_flush_lock);

  *** DEADLOCK ***

 1 lock held by ksoftirqd/1/23:
  #0: ffff88817e63a160 (&fq->mq_flush_lock){?.-.}-{2:2}, at: flush_end_io+0x69/0x460

 stack backtrace:
 CPU: 1 PID: 23 Comm: ksoftirqd/1 Kdump: loaded Not tainted 6.2.0+ #4 7ac3eece8c11633fa2815efb5a20b1c5b55011f5
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5a/0x80
  print_usage_bug+0x6db/0x6f0
  ? print_lockdep_off+0x30/0x30
  ? pvclock_clocksource_read+0xec/0x180
  ? kvm_sched_clock_read+0x14/0x40
  ? sched_clock_cpu+0x6f/0x1f0
  ? rcu_read_lock_sched_held+0x34/0x70
  ? trace_irq_disable+0x47/0x110
  ? lockdep_hardirqs_on_prepare+0x17e/0x2e0
  ? mod_find+0x100/0x120
  ? nvmet_bio_done+0x37/0x180 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  ? nvmet_bio_done+0x37/0x180 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  ? __module_address+0xe1/0x100
  ? ftrace_graph_ret_addr+0x1d/0xa0
  ? orc_find+0x164/0x4f0
  ? pvclock_clocksource_read+0xec/0x180
  ? __update_load_avg_cfs_rq+0x34a/0x500
  ? kvm_sched_clock_read+0x14/0x40
  ? pvclock_clocksource_read+0xec/0x180
  ? kvm_sched_clock_read+0x14/0x40
  ? sched_clock_cpu+0x6f/0x1f0
  ? pvclock_clocksource_read+0xec/0x180
  ? mod_find+0x100/0x120
  ? nvmet_bio_done+0x37/0x180 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  ? nvmet_bio_done+0x37/0x180 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  ? __module_address+0xe1/0x100
  ? ftrace_graph_ret_addr+0x1d/0xa0
  ? orc_find+0x164/0x4f0
  ? unwind_next_frame+0xd55/0xd90
  ? ret_from_fork+0x1f/0x30
  ? stack_trace_save+0x130/0x130
  mark_lock_irq+0x7d0/0x7e0
  ? save_trace+0x4c0/0x4c0
  ? stack_trace_save+0xe0/0x130
  ? stack_trace_snprint+0xe0/0xe0
  ? lock_is_held_type+0x8b/0x110
  ? fcloop_fcp_op+0x92/0x310 [nvme_fcloop 103a43b51b136ed1b9f0be7fbf6a4787d9fa52e2]
  ? save_trace+0x447/0x4c0
  mark_lock+0x19c/0x220
  lockdep_hardirqs_on_prepare+0x169/0x2e0
  trace_hardirqs_on+0x34/0x40
  _raw_spin_unlock_irq+0x24/0x40
  fcloop_fcp_op+0x92/0x310 [nvme_fcloop 103a43b51b136ed1b9f0be7fbf6a4787d9fa52e2]
  ? fcloop_h2t_xmt_ls_rsp+0x150/0x150 [nvme_fcloop 103a43b51b136ed1b9f0be7fbf6a4787d9fa52e2]
  nvmet_fc_fcp_nvme_cmd_done+0x325/0x480 [nvmet_fc 14b6afa92b89f97b59495a77b43c19ce2c6642ff]
  __nvmet_req_complete+0x20e/0x270 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  ? nvmet_req_complete+0x30/0x30 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  nvmet_req_complete+0xe/0x30 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  nvmet_bio_done+0x38/0x180 [nvmet b6cfd0efddc3f6e687219ea59b5a92b951bb0ddd]
  blk_update_request+0x343/0x6a0
  blk_mq_end_request+0x22/0x40
  blk_flush_complete_seq+0x22b/0x530
  flush_end_io+0x424/0x460
  __blk_mq_end_request+0x16c/0x190
  blk_done_softirq+0x95/0xb0
  __do_softirq+0x14b/0x369
  ? run_ksoftirqd+0x33/0x70
  ? ksoftirqd_should_run+0x20/0x20
  run_ksoftirqd+0x33/0x70
  smpboot_thread_fn+0x294/0x390
  ? cpu_report_death+0x170/0x170
  kthread+0x16d/0x190
  ? kthread_blkcg+0x60/0x60
  ret_from_fork+0x1f/0x30
  </TASK>

It looks similar to the reports/fixes from here:

b3c6a5997541 ("block: Fix a lockdep complaint triggered by request queue flushing")
fb01a2932e81 ("blk-mq: add new API of blk_mq_hctx_set_fq_lock_class")

Daniel


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-02-20 16:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-30 11:15 nvme-6.3: blktests nvme/003 null pointer for fc transport Daniel Wagner
2023-01-31 23:27 ` James Smart
2023-02-01  8:10   ` Daniel Wagner
2023-02-01  8:24     ` Hannes Reinecke
2023-02-20 16:04     ` Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox