public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
@ 2025-11-14 15:07 Aristeu Rozanski
       [not found] ` <CAGtn9rmSq9+6o1-=tQwYGRKRUSOXkSJnCdSosNnpW0BvnxaNLQ@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Aristeu Rozanski @ 2025-11-14 15:07 UTC (permalink / raw)
  To: linux-nvme
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg

While running blktests/nvme internally we managed to hit
a race in which a lsrcv work is still pending while the
rport gets removed. Strangely, this is way easier to
reproduce on a s390x than any other architecture we test.

	[12996.173074] nvme nvme5: NVME-FC{5}: controller connectivity lost. Awaiting Reconnect
	[12996.173446] Unable to handle kernel pointer dereference in virtual kernel address space
	[12996.173449] Failing address: e64d52f26d80e000 TEID: e64d52f26d80e803
	[12996.173452] Fault in home space mode while using kernel ASCE.
	[12996.173455] AS:0000000195920007 R3:0000000000000024
	[12996.173560] Oops: 0038 ilc:2 [#1]SMP
	[12996.173566] Modules linked in: nvme_fcloop nvmet_fc nvmet nvme_fc nvme_fabrics nvme nvme_core nvme_keyring nvme_auth sunrpc rfkill virtio_gpu virtio_dma_buf drm_client_lib drm_shmem_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_kms_helper virtio_net net_failover fb failover virtio_input vfio_ccw mdev vfio_iommu_type1 vfio iommufd drm drm_panel_orientation_quirks font i2c_core fuse loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock ctcm fsm qeth ccwgroup zfcp scsi_transport_fc qdio dasd_fba_mod dasd_eckd_mod dasd_mod xfs ghash_s390 prng virtio_blk des_s390 libdes sha3_512_s390 sha3_256_s390 sha_common dm_mirror dm_region_hash dm_log dm_mod paes_s390 crypto_engine pkey_cca pkey_ep11 zcrypt pkey_pckmo pkey aes_s390 [last unloaded: nvmet]
	[12996.173639] CPU: 1 UID: 0 PID: 633486 Comm: kworker/1:31 Kdump: loaded Not tainted 6.18.0-0.rc0.53c18dc078bb.1.RHEL100912.el10_1.s390x #1 NONE
	[12996.173644] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
	[12996.173646] Workqueue: events nvme_fc_handle_ls_rqst_work [nvme_fc]
	[12996.173661] Krnl PSW : 0704c00180000000 e64d52f26d80e69e (0xe64d52f26d80e69e)
	[12996.173685]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
	[12996.173688] Krnl GPRS: 0000000000000000 e64d52f26d80e69e 00000001058e9830 00000000874f8800
	[12996.173691]            00000001058e9830 0000017ed7b593d0 0000000000000000 00000000874f8868
	[12996.173693]            00000000ec994000 00000000874f8800 0000000000000000 00000001058e9830
	[12996.173695]            0000000084ef2a00 0000000000000000 0000017ed7b594dc 000000ff5884fbb8
	[12996.173700] Krnl Code: Bad PSW.
	[12996.173702] Call Trace:
	[12996.173703]  [<e64d52f26d80e69e>] 0xe64d52f26d80e69e
	[12996.173707]  [<0000017ed7b32f40>] nvme_fc_xmt_ls_rsp+0x60/0xc0 [nvme_fc]
	[12996.173713]  [<0000017ed7b34218>] nvme_fc_handle_ls_rqst_work+0x138/0x240 [nvme_fc]
	[12996.173719]  [<0000017f5702b8cc>] process_one_work+0x1bc/0x3e0
	[12996.173730]  [<0000017f5702c53e>] worker_thread+0x23e/0x440
	[12996.173734]  [<0000017f5703830c>] kthread+0x12c/0x280
	[12996.173738]  [<0000017f56fb1d6c>] __ret_from_fork+0x3c/0x140
	[12996.173744]  [<0000017f57af236a>] ret_from_fork+0xa/0x30
	[12996.173750] Last Breaking-Event-Address:
	[12996.173750]  [<0000017ed7b594da>] fcloop_t2h_xmt_ls_rsp+0x10a/0x140 [nvme_fcloop]
	[12996.173757] Kernel panic - not syncing: Fatal exception: panic_on_oops

Cc: Justin Tee <justin.tee@broadcom.com>
Cc: Naresh Gottumukkala <nareshgottumukkala83@gmail.com>
Cc: Paul Ely <paul.ely@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
---
 drivers/nvme/host/fc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 03987f497a5b5..1762f62ebf820 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -518,6 +518,8 @@ nvme_fc_free_rport(struct kref *ref)
 			localport_to_lport(rport->remoteport.localport);
 	unsigned long flags;
 
+	cancel_work_sync(&rport->lsrcv_work);
+
 	WARN_ON(rport->remoteport.port_state != FC_OBJSTATE_DELETED);
 	WARN_ON(!list_empty(&rport->ctrl_list));
 



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
       [not found] ` <CAGtn9rmSq9+6o1-=tQwYGRKRUSOXkSJnCdSosNnpW0BvnxaNLQ@mail.gmail.com>
@ 2025-11-14 18:42   ` Aristeu Rozanski
  2025-11-15  0:56     ` Justin Tee
  2025-11-15  3:56   ` Aristeu Rozanski
  1 sibling, 1 reply; 12+ messages in thread
From: Aristeu Rozanski @ 2025-11-14 18:42 UTC (permalink / raw)
  To: Ewan Milne
  Cc: linux-nvme, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

Hi Ewan,

On Fri, Nov 14, 2025 at 12:47:14PM -0500, Ewan Milne wrote:
> OK, but it seems to me that if lsrcv_work is pending then lsrcv_list likely
> has entries on it.
> So does this fix the whole problem?
> 
> Could you maybe add WARN_ON(!list_empty(&rport->ls_rcv_list)); so we'll
> find out?

since I can reproduce it easily (less than an hour usually) I'll add the
WARN_ON and double check if it triggers before the bug reproduces.
Thanks

-- 
Aristeu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
  2025-11-14 18:42   ` Aristeu Rozanski
@ 2025-11-15  0:56     ` Justin Tee
  2025-11-15  3:50       ` Aristeu Rozanski
  0 siblings, 1 reply; 12+ messages in thread
From: Justin Tee @ 2025-11-15  0:56 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: linux-nvme, Ewan Milne, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

Hi Aristeu,

Is it possible to share the sequence of blktest/nvme tests that were 
executed that resulted in this call trace?  Or, was there a specific 
test case number that results in this call trace frequently?

Regards,
Justin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
  2025-11-15  0:56     ` Justin Tee
@ 2025-11-15  3:50       ` Aristeu Rozanski
  2025-11-17 23:17         ` Justin Tee
  0 siblings, 1 reply; 12+ messages in thread
From: Aristeu Rozanski @ 2025-11-15  3:50 UTC (permalink / raw)
  To: Justin Tee
  Cc: linux-nvme, Ewan Milne, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

Hi Justin,

On Fri, Nov 14, 2025 at 04:56:47PM -0800, Justin Tee wrote:
> Is it possible to share the sequence of blktest/nvme tests that were
> executed that resulted in this call trace?  Or, was there a specific test
> case number that results in this call trace frequently?

there's nothing different on the test itself, I just run every test under
blktest/nvme in loop.

But like I said, I never managed to reproduce on x86_64 and we had no report on
aarch64 nor ppc64le. s390x "machines" have 2 CPUs, maybe lowering the number of
CPUs would make it more likely?

-- 
Aristeu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
       [not found] ` <CAGtn9rmSq9+6o1-=tQwYGRKRUSOXkSJnCdSosNnpW0BvnxaNLQ@mail.gmail.com>
  2025-11-14 18:42   ` Aristeu Rozanski
@ 2025-11-15  3:56   ` Aristeu Rozanski
       [not found]     ` <CAGtn9r=n024waWZekDMwSRAM+JM13FYWhXjPuA18WyO6Kcmnyw@mail.gmail.com>
  1 sibling, 1 reply; 12+ messages in thread
From: Aristeu Rozanski @ 2025-11-15  3:56 UTC (permalink / raw)
  To: Ewan Milne
  Cc: linux-nvme, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

Hi Ewan,
On Fri, Nov 14, 2025 at 12:47:14PM -0500, Ewan Milne wrote:
> Could you maybe add WARN_ON(!list_empty(&rport->ls_rcv_list)); so we'll
> find out?

reproduced twice with just the warning, no warning and interestingly
both had the same variation with workqueue hitting a list corruption:

	[ 4326.096413] nvmet: Created discovery controller 2 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
	[ 4326.097073] nvme nvme1: NVME-FC{1}: controller connect complete
	[ 4326.097078] nvme nvme1: NVME-FC{1}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
	[ 4326.097570] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
	[ 4326.110785] list_del corruption, 00000000f88351f8->next is NULL
	[ 4326.110813] ------------[ cut here ]------------
	[ 4326.110814] kernel BUG at lib/list_debug.c:52!
	[ 4326.110893] monitor event: 0040 ilc:2 [#1]SMP
	[ 4326.110899] Modules linked in: nvme_fcloop nvmet_fc nvmet nvme_fc nvme_fabrics nvme nvme_core nvme_keyring nvme_auth sunrpc rfkill virtio_gpu virtio_dma_buf drm_client_lib drm_shmem_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_kms_helper fb virtio_net virtio_input net_failover failover vfio_ccw mdev vfio_iommu_type1 vfio iommufd drm fuse drm_panel_orientation_quirks font loop i2c_core nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock ctcm fsm qeth ccwgroup zfcp scsi_transport_fc qdio dasd_fba_mod dasd_eckd_mod dasd_mod xfs ghash_s390 prng des_s390 libdes sha3_512_s390 sha3_256_s390 virtio_blk sha_common dm_mirror dm_region_hash dm_log dm_mod paes_s390 crypto_engine pkey_cca pkey_ep11 zcrypt pkey_pckmo pkey aes_s390 [last unloaded: nvmet]
	[ 4326.110958] CPU: 0 UID: 0 PID: 164768 Comm: kworker/u9:5 Kdump: loaded Not tainted 6.18.0-0.rc0.53c18dc078bb.1.RHEL100912.el10.s390x #1 NONE
	[ 4326.110962] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
	[ 4326.110964] Workqueue: \x98 0x0 (nvmet-wq)
	[ 4326.110980] Krnl PSW : 0404e00180000000 000001e1297369f2 (__list_del_entry_valid_or_report+0x112/0x130)
	[ 4326.110987]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
	[ 4326.110989] Krnl GPRS: 0000000000000030 0000000000000000 0000000000000033 000001e12a9136e8
	[ 4326.110992]            00000001fe51d000 0000000000000000 0000000000000000 fffffffffffffff8
	[ 4326.110993]            0000000080238028 0000000000000000 0000000000000000 00000000f88351f8
	[ 4326.110995]            00000001016f6900 0000000000000000 000001e1297369ee 000001612a8cbca8
	[ 4326.111003] Krnl Code: 000001e1297369e2: c02000441908        larl    %r2,000001e129fb9bf2
		   000001e1297369e8: c0e5ffcb7e58       brasl   %r14,000001e1290a6698
		  #000001e1297369ee: af000000           mc      0,0
		  >000001e1297369f2: b9040032           lgr     %r3,%r2
		   000001e1297369f6: c02000441913       larl    %r2,000001e129fb9c1c
		   000001e1297369fc: c0e5ffcb7e4e       brasl   %r14,000001e1290a6698
		   000001e129736a02: af000000           mc      0,0
		   000001e129736a06: 0707               bcr     0,%r7
	[ 4326.111015] Call Trace:
	[ 4326.111017]  [<000001e1297369f2>] __list_del_entry_valid_or_report+0x112/0x130
	[ 4326.111019] ([<000001e1297369ee>] __list_del_entry_valid_or_report+0x10e/0x130)
	[ 4326.111022]  [<000001e129130d58>] move_linked_works+0x68/0xe0
	[ 4326.111027]  [<000001e129134554>] worker_thread+0x1f4/0x440
	[ 4326.111030]  [<000001e12914036c>] kthread+0x12c/0x280
	[ 4326.111034]  [<000001e1290b9d6c>] __ret_from_fork+0x3c/0x140
	[ 4326.111038]  [<000001e129bfa11a>] ret_from_fork+0xa/0x30
	[ 4326.111042] Last Breaking-Event-Address:
	[ 4326.111043]  [<000001e1290a66e4>] _printk+0x4c/0x58
	[ 4326.111048] Kernel panic - not syncing: Fatal exception: panic_on_oops

Something I did find interesting and saw many times that might help:

	[ 4325.112850] nvme nvme0: qid 0: authenticated
	[ 4325.113206] nvme nvme0: NVME-FC{0}: controller connect complete
	[ 4325.113745] nvme nvme0: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1", hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
	[ 4325.129993] (NULL device *): {0:0} Association freed
	[ 4325.130010] (NULL device *): Disconnect LS failed: No Association
	[ 4325.178364] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
	[ 4325.410043] (NULL device *): {0:1} Association deleted
	[ 4325.430596] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000001  rport wwpn 0x20001100ab000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
	[ 4325.430754] (NULL device *): queue 0 connect admin queue failed (-111).
	[ 4325.430758] nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (-111)
	[ 4325.430761] nvme nvme0: NVME-FC{0}: Reconnect attempt in 2 seconds
	[ 4325.430767] nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", hostnqn: nqn.2014-08.org.nvmexpress:uuid:2543b704-63fd-45d1-bccd-32e33962e07b
	[ 4325.470264] (NULL device *): {0:1} Association freed
	[ 4325.470282] (NULL device *): Disconnect LS failed: No Association
	[ 4325.112825] nvme nvme0: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192

(the "NULL device *" part)

I agree my patch does look like fixing the symptom, but I haven't touched
the nvme code before, so it's very possible I'm missing the bigger picture.

-- 
Aristeu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
       [not found]     ` <CAGtn9r=n024waWZekDMwSRAM+JM13FYWhXjPuA18WyO6Kcmnyw@mail.gmail.com>
@ 2025-11-17  5:06       ` Aristeu Rozanski
  2025-11-18 14:45       ` Aristeu Rozanski
  1 sibling, 0 replies; 12+ messages in thread
From: Aristeu Rozanski @ 2025-11-17  5:06 UTC (permalink / raw)
  To: Ewan Milne
  Cc: linux-nvme, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

On Sun, Nov 16, 2025 at 04:51:30PM -0500, Ewan Milne wrote:
> Try running the test with the kernel parameter "slub_debug=FZPU" and see if
> you get
> a more consistent failure.  This will poison the memory with a specific
> pattern when
> the kfree() frees the object, a subsequent list debug message or stack
> trace with
> registers may show the pattern.

Will do.

> I suspect it is just very sensitive to the timing.

It is. Was trying to manually poison (memset + not freeing) with
different patterns and it was enough to not be able to reproduce
anymore. Will try the parameter anyway.

> My point was that doing a cancel_work() will prevent freeing the object
> with a work
> item queued, but if there is something on the associated list it would I
> think result in a
> memory leak (and possibly the DMA mapping, I'll need to look at it more
> closely).
> 
> The fix is valuable in any case.

Thanks

-- 
Aristeu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
  2025-11-15  3:50       ` Aristeu Rozanski
@ 2025-11-17 23:17         ` Justin Tee
  2025-12-03  1:36           ` Justin Tee
  0 siblings, 1 reply; 12+ messages in thread
From: Justin Tee @ 2025-11-17 23:17 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: linux-nvme, Ewan Milne, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

Hi Aristeu,

> But like I said, I never managed to reproduce on x86_64 and we had no report on
> aarch64 nor ppc64le. s390x "machines" have 2 CPUs, maybe lowering the number of
> CPUs would make it more likely?

Thanks for sharing this information.  I have logically offlined the
CPUs on my x86_64 system down to 2 CPUs and have reproduced the issue
on x86_64.

Broadcom will have a deeper look into the call trace and will report back.

Regards,
Justin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
       [not found]     ` <CAGtn9r=n024waWZekDMwSRAM+JM13FYWhXjPuA18WyO6Kcmnyw@mail.gmail.com>
  2025-11-17  5:06       ` Aristeu Rozanski
@ 2025-11-18 14:45       ` Aristeu Rozanski
  1 sibling, 0 replies; 12+ messages in thread
From: Aristeu Rozanski @ 2025-11-18 14:45 UTC (permalink / raw)
  To: Ewan Milne
  Cc: linux-nvme, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

On Sun, Nov 16, 2025 at 04:51:30PM -0500, Ewan Milne wrote:
> Try running the test with the kernel parameter "slub_debug=FZPU" and see if
> you get
> a more consistent failure.  This will poison the memory with a specific
> pattern when
> the kfree() frees the object, a subsequent list debug message or stack
> trace with
> registers may show the pattern.
> 
> I suspect it is just very sensitive to the timing.

Turns out it did made reproducing this faster.
Thanks

-- 
Aristeu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
  2025-11-17 23:17         ` Justin Tee
@ 2025-12-03  1:36           ` Justin Tee
       [not found]             ` <CAGtn9rmQW9w42oMVGpXZB4OuifwF3XzgVaxuHyXN1aYpzjRskg@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Justin Tee @ 2025-12-03  1:36 UTC (permalink / raw)
  To: Aristeu Rozanski, Ewan Milne
  Cc: linux-nvme, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

> Broadcom will have a deeper look into the call trace and will report back.
Providing a status update that Broadcom does not quite agree with the
proposed patch yet.  Specifically, because the crash occurs in
target/fcloop.c and not host/fc.c

    [12996.173750] Last Breaking-Event-Address:
    [12996.173750]  [<0000017ed7b594da>]
fcloop_t2h_xmt_ls_rsp+0x10a/0x140 [nvme_fcloop]
    [12996.173757] Kernel panic - not syncing: Fatal exception: panic_on_oops

If anything, the suggested cancel_work_sync(&rport->lsrcv_work) in
host/fc.c is delaying or avoiding the real issue.  And, I believe the
issue is related to the following commit:
“10c165af35d2 nvmet-fcloop: call done callback even when remote port is gone”

The panic occurs exactly during the lsrsp->done(lsrsp) call.  Are we
sure lsrsp is guaranteed to be valid when targetport is gone?
Broadcom will investigate the target/fcloop.c module some more and
report back again.

Regards,
Justin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
       [not found]             ` <CAGtn9rmQW9w42oMVGpXZB4OuifwF3XzgVaxuHyXN1aYpzjRskg@mail.gmail.com>
@ 2025-12-04  1:44               ` Justin Tee
  2025-12-04  3:59                 ` Aristeu Rozanski
  0 siblings, 1 reply; 12+ messages in thread
From: Justin Tee @ 2025-12-04  1:44 UTC (permalink / raw)
  To: Ewan Milne
  Cc: Aristeu Rozanski, linux-nvme, Justin Tee, Naresh Gottumukkala,
	Paul Ely, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg

> Aristeau's original stack trace was in nvme-fc (Initiator) though, I think.
Kind of sort of, the Last Breaking-Event-Address is on the target side
(target/fcloop.c):

[12996.173750] Last Breaking-Event-Address:
[12996.173750]  [<0000017ed7b594da>] fcloop_t2h_xmt_ls_rsp+0x10a/0x140
[nvme_fcloop]
[12996.173757] Kernel panic - not syncing: Fatal exception: panic_on_oops

It just so happens the intended lsrsp->done(lsrsp) call is trying to
reach into nvme_fc (initiator) code because done is supposed to be set
to nvme_fc_xmt_ls_rsp_done.

Additionally, consistent reproduction on a x86_64 system yields a call
trace like below:

Oops: general protection fault, probably for non-canonical address
0xd3ea351027e1e8d8: 0000 [#1] SMP NOPTI

Workqueue: events nvme_fc_handle_ls_rqst_work [nvme_fc]
RIP: 0010:fcloop_t2h_xmt_ls_rsp+0xca/0x130 [nvme_fcloop]
Code: 8b 72 18 41 b8 f0 01 00 00 48 c7 c1 40 d0 a8 c0 48 c7 c2 fa d2
a8 c0 48 c7 c7 50 d4 a8 c0 e8 8d 58 7b f3 48 8b 43 18 48 89 df <ff> d0
0f 1f 00 b9 fd 01 00 00 48 c7 c2 40 d0 a8 c0 48 c7 c6 fa d2
RSP: 0018:ff50b14f0a9abd88 EFLAGS: 00010246
RAX: d3ea351027e1e8d8 RBX: ff2ab922d2016ef0 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ff2ab922d2016ef0
RBP: 0000000000000000 R08: 0000000000000000 R09: ff50b14f0a9abc40
R10: ff50b14f0a9abc38 R11: ff2ab9323f9c6328 R12: ff2ab922d79bc000
R13: ff2ab922c59c0800 R14: 0800000002000000 R15: ff2ab922d79bc098
FS:  0000000000000000(0000) GS:ff2ab92a687fc000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055fecf7f6a60 CR3: 00000001116c0005 CR4: 0000000000773ef0
PKRU: 55555554
Call Trace:
 <TASK>
 nvme_fc_xmt_ls_rsp+0x46/0x90 [nvme_fc deaa37bfaf3bed135f3fe0a56933d272bbdc0340]
 nvme_fc_handle_ls_rqst_work+0xd0/0x650 [nvme_fc
deaa37bfaf3bed135f3fe0a56933d272bbdc0340]
 process_one_work+0x18e/0x3c0
 worker_thread+0x29d/0x3c0
 kthread+0xfc/0x210
 ret_from_fork+0x197/0x1d0
 ret_from_fork_asm+0x1a/0x30
 </TASK>


What’s happening is that rport->remoteport.port_state is not set to
FC_OBJSTATE_ONLINE so nvme_fc_handle_ls_rqst is not setting the
lsrsp->done function pointer to nvme_fc_xmt_ls_rsp_done.  So, when
target/fcloop.c calls lsrsp->done it’s crashing.

I have a patch in mind to resolve this, and will report back.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
  2025-12-04  1:44               ` Justin Tee
@ 2025-12-04  3:59                 ` Aristeu Rozanski
  2025-12-04 19:46                   ` Justin Tee
  0 siblings, 1 reply; 12+ messages in thread
From: Aristeu Rozanski @ 2025-12-04  3:59 UTC (permalink / raw)
  To: Justin Tee
  Cc: Ewan Milne, linux-nvme, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

On Wed, Dec 03, 2025 at 05:44:14PM -0800, Justin Tee wrote:
> I have a patch in mind to resolve this, and will report back.

FWIW, test kernel with the revert still running since this morning (~13h), no
crashes.

-- 
Aristeu



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
  2025-12-04  3:59                 ` Aristeu Rozanski
@ 2025-12-04 19:46                   ` Justin Tee
  0 siblings, 0 replies; 12+ messages in thread
From: Justin Tee @ 2025-12-04 19:46 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: Ewan Milne, linux-nvme, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg

> FWIW, test kernel with the revert still running since this morning (~13h), no
> crashes.
Thanks for confirming Aristeu.  I’ll be raising a patch for review very soon.

Regards,
Justin


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-12-04 19:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 15:07 [PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport Aristeu Rozanski
     [not found] ` <CAGtn9rmSq9+6o1-=tQwYGRKRUSOXkSJnCdSosNnpW0BvnxaNLQ@mail.gmail.com>
2025-11-14 18:42   ` Aristeu Rozanski
2025-11-15  0:56     ` Justin Tee
2025-11-15  3:50       ` Aristeu Rozanski
2025-11-17 23:17         ` Justin Tee
2025-12-03  1:36           ` Justin Tee
     [not found]             ` <CAGtn9rmQW9w42oMVGpXZB4OuifwF3XzgVaxuHyXN1aYpzjRskg@mail.gmail.com>
2025-12-04  1:44               ` Justin Tee
2025-12-04  3:59                 ` Aristeu Rozanski
2025-12-04 19:46                   ` Justin Tee
2025-11-15  3:56   ` Aristeu Rozanski
     [not found]     ` <CAGtn9r=n024waWZekDMwSRAM+JM13FYWhXjPuA18WyO6Kcmnyw@mail.gmail.com>
2025-11-17  5:06       ` Aristeu Rozanski
2025-11-18 14:45       ` Aristeu Rozanski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox