All of lore.kernel.org
 help / color / mirror / Atom feed
* NetApp Snapmirror Active Sync NVMe target and Linux initiator: Paths disappearing after a power failure of one NetApp a150 system
@ 2025-09-18 16:09 Thomas Glanzmann
  0 siblings, 0 replies; only message in thread
From: Thomas Glanzmann @ 2025-09-18 16:09 UTC (permalink / raw)
  To: linux-nvme

Hello,
today I tried to use two NetApp AFF A150 which consists of two
controllers and 8 disks each with NVMe/TCP. The idea is that you have 2
paths to each controller via two VLANs. If a controller or A150 fails the other
one should take over. That worked, however when the manually power fenced
system came back, I saw that the paths to it where 'chaining' and than
disappearing. However, I could manually reconnect the paths.
I assume that the NVME target of the powerfenced system came up without the
full configuration loaded.  As a result the Linux NVMe initiator droped the
paths.  Is that correct or is it a bug in the Linux NVMe stack. I used Debian
trixie (6.12.43+deb13-cloud-amd64).

- Before I had 8 paths:

(debian-05) [~] nvme list-subsys /dev/nvme4n1
nvme-subsys4 - NQN=nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:a44b1e42-29ec-1c09-c043-741282002d9c
\
 +- nvme10 tcp traddr=10.0.10.238,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme11 tcp traddr=10.0.20.238,trsvcid=4420,src_addr=10.0.20.25 live non-optimized
 +- nvme4 tcp traddr=10.0.10.235,trsvcid=4420,src_addr=10.0.10.25 live optimized
 +- nvme5 tcp traddr=10.0.20.235,trsvcid=4420,src_addr=10.0.20.25 live optimized
 +- nvme6 tcp traddr=10.0.10.236,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme7 tcp traddr=10.0.20.236,trsvcid=4420,src_addr=10.0.20.25 live non-optimized
 +- nvme8 tcp traddr=10.0.10.237,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme9 tcp traddr=10.0.20.237,trsvcid=4420,src_addr=10.0.20.25 live non-optimized

(debian-05) [~] mount /dev/nvme4n1 /mnt
(debian-05) [~] cd /mnt/
(debian-05) [/mnt] while true; do date | tee -a date.log; sync; sleep 1; done
...
Thu Sep 18 16:04:49 CEST 2025
Thu Sep 18 16:04:50 CEST 2025
# 10 seconds for the path failover
Thu Sep 18 16:05:00 CEST 2025
Thu Sep 18 16:05:01 CEST 2025
...

- While the first AFF A150 was offline:

(debian-05) [/mnt] nvme list-subsys /dev/nvme4n1
nvme-subsys4 - NQN=nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:a44b1e42-29ec-1c09-c043-741282002d9c
\
 +- nvme10 tcp traddr=10.0.10.238,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme11 tcp traddr=10.0.20.238,trsvcid=4420,src_addr=10.0.20.25 live non-optimized
 +- nvme4 tcp traddr=10.0.10.235,trsvcid=4420 connecting optimized
 +- nvme5 tcp traddr=10.0.20.235,trsvcid=4420 connecting optimized
 +- nvme6 tcp traddr=10.0.10.236,trsvcid=4420 connecting non-optimized
 +- nvme7 tcp traddr=10.0.20.236,trsvcid=4420 connecting non-optimized
 +- nvme8 tcp traddr=10.0.10.237,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme9 tcp traddr=10.0.20.237,trsvcid=4420,src_addr=10.0.20.25 live non-optimized

- Than I don't have the output because I ran it in watch, it said 'changing' for the failed paths.

- And than the paths disappeared completely and did not come back.

(debian-05) [/mnt] nvme list-subsys /dev/nvme4n1
nvme-subsys4 - NQN=nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:a44b1e42-29ec-1c09-c043-741282002d9c
\
 +- nvme10 tcp traddr=10.0.10.238,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme11 tcp traddr=10.0.20.238,trsvcid=4420,src_addr=10.0.20.25 live non-optimized
 +- nvme8 tcp traddr=10.0.10.237,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme9 tcp traddr=10.0.20.237,trsvcid=4420,src_addr=10.0.20.25 live non-optimized

- Than I manually reconnected:

(debian-05) [/mnt] nvme connect -t tcp -a 10.0.10.235 -s 4420 -n nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem -i 14
(debian-05) [/mnt] nvme connect -t tcp -a 10.0.20.235 -s 4420 -n nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem -i 14
(debian-05) [/mnt] nvme connect -t tcp -a 10.0.10.236 -s 4420 -n nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem -i 14
(debian-05) [/mnt] nvme connect -t tcp -a 10.0.20.236 -s 4420 -n nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem -i 14

(debian-05) [/mnt] nvme list-subsys /dev/nvme4n1
nvme-subsys4 - NQN=nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:a44b1e42-29ec-1c09-c043-741282002d9c
\
 +- nvme10 tcp traddr=10.0.10.238,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme11 tcp traddr=10.0.20.238,trsvcid=4420,src_addr=10.0.20.25 live non-optimized
 +- nvme12 tcp traddr=10.0.20.236,trsvcid=4420,src_addr=10.0.20.25 live non-optimized
 +- nvme5 tcp traddr=10.0.10.235,trsvcid=4420,src_addr=10.0.10.25 live optimized
 +- nvme6 tcp traddr=10.0.20.235,trsvcid=4420,src_addr=10.0.20.25 live optimized
 +- nvme7 tcp traddr=10.0.10.236,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme8 tcp traddr=10.0.10.237,trsvcid=4420,src_addr=10.0.10.25 live non-optimized
 +- nvme9 tcp traddr=10.0.20.237,trsvcid=4420,src_addr=10.0.20.25 live non-optimizedo

In dmesg I saw:

[90303.743154] nvme nvme6: Reconnecting in 10 seconds...
[90311.695617] nvme nvme4: Connect Invalid Data Parameter, subsysnqn "nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem"
[90311.695714] nvme nvme4: failed to connect queue: 0 ret=16770
[90311.695754] nvme nvme4: Failed reconnect attempt 1/-1
[90311.695775] nvme nvme5: Connect Invalid Data Parameter, subsysnqn "nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem"
[90311.695789] nvme nvme4: Removing controller (16770)...
[90311.695876] nvme nvme5: failed to connect queue: 0 ret=16770
[90311.695909] nvme nvme4: Removing ctrl: NQN "nqn.1992-08.com.netapp:sn.347a30cc947511f083a6d039ea2800fc:subsystem.subsystem"

full dmesg: https://tg.st/u/297bb062eca5c9d05b533691ce98c2fffc31deafe45fe399d7492381b47313bb.txt

Cheers,
	Thomas


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2025-09-18 16:10 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-18 16:09 NetApp Snapmirror Active Sync NVMe target and Linux initiator: Paths disappearing after a power failure of one NetApp a150 system Thomas Glanzmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.