Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>,
	Keith Busch <keith.busch@wdc.com>,
	linux-nvme@lists.infradead.org, Keith Busch <kbusch@kernel.org>,
	Daniel Wagner <daniel.wagner@suse.de>
Subject: Re: [PATCHv3] nvme-mpath: delete disk after last connection
Date: Mon, 10 May 2021 15:01:56 +0200	[thread overview]
Message-ID: <42bf4da0-c266-5413-8bac-949bdadd4024@suse.de> (raw)
In-Reply-To: <20210510062346.GA30116@lst.de>

On 5/10/21 8:23 AM, Christoph Hellwig wrote:
> On Fri, May 07, 2021 at 07:02:52PM +0200, Hannes Reinecke wrote:
>> On 5/7/21 8:46 AM, Christoph Hellwig wrote:
>>> On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
>>>> PCI and fabrics have different defaults; for PCI the device goes away if
>>>> the last path (ie the controller) goes away, for fabrics it doesn't if the
>>>> device is mounted.
>>>
>>> Err, no.  For fabrics we reconnect a while, but otherwise the behavior
>>> is the same right now.
>>>
>> No, that is not the case.
>>
>> When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), the 
>> nvme device is completely removed, irrespective on whether it's mounted or 
>> not.
>> When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie the 
>> nsnhead) will _stay_ when mounted (as the refcount is not zero).
> 
> Yes.  But that has nothing to do with fabrics as you claimed above, but
> with the fact if the subsystem supports multiple controller (and thus
> shared namespaces) or not.
> 
It's still broken, though.

I've setup a testbed to demonstrate what I mean.

I have created a qemu instance with 3 NVMe devices, one for booting and
two for MD RAID.
After boot, MD RAID says this:

 # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk


Now I detach the PCI device for /dev/nvme3:

(qemu) device_del nvme-rp90
(qemu) [  183.512585] pcieport 0000:00:09.0: pciehp: Slot(0-2):
Attention button pressed
[  183.515462] pcieport 0000:00:09.0: pciehp: Slot(0-2): Powering off
due to button press

And validate that the device is gone:

# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:08.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:09.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface
Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6
port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller
(rev 02)
01:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)
02:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)

Checking MD I still get:
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

IE MD hasn't registered _anything_, even though the device is physically
not present anymore.
And to make matters worse, 'nvme list' says:

# nvme list
Node             SN                   Model
       Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     SLESNVME1            QEMU NVMe Ctrl
       1          17.18  GB /  17.18  GB    512   B +  0 B   1.0
/dev/nvme1n1     SLESNVME2            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0
/dev/nvme2n1     |U                   b��||U
       -1          0.00   B /   0.00   B      1   B +  0 B   ���||U

which arguably is a bug in itself, as we shouldn't display weird strings
here. But no matter.

Now I'm reattaching the PCI device:

(qemu) device_add nvme,bus=rp90,id=nvme-rp90,subsys=subsys3
[   49.261163] pcieport 0000:00:09.0: pciehp: Slot(0-2): Attention
button pressed
[   49.263915] pcieport 0000:00:09.0: pciehp: Slot(0-2) Powering on due
to button press
[   49.267188] pcieport 0000:00:09.0: pciehp: Slot(0-2): Card present
[   49.269505] pcieport 0000:00:09.0: pciehp: Slot(0-2): Link Up
[   49.406035] pci 0000:03:00.0: [1b36:0010] type 00 class 0x010802
[   49.411585] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[   49.417627] pci 0000:03:00.0: BAR 0: assigned [mem
0xc1000000-0xc1003fff 64bit]
[   49.421057] pcieport 0000:00:09.0: PCI bridge to [bus 03]
[   49.424071] pcieport 0000:00:09.0:   bridge window [io  0x6000-0x6fff]
[   49.428157] pcieport 0000:00:09.0:   bridge window [mem
0xc1000000-0xc11fffff]
[   49.431379] pcieport 0000:00:09.0:   bridge window [mem
0x804000000-0x805ffffff 64bit pref]
[   49.436591] nvme nvme3: pci function 0000:03:00.0
[   49.438303] nvme 0000:03:00.0: enabling device (0000 -> 0002)
[   49.446746] nvme nvme3: 1/0/0 default/read/poll queues
(qemu) device_add nvme-ns,bus=nvme-rp90,drive=nvme-3,nsid=1
[   64.781295] nvme nvme3: rescanning namespaces.
[   64.806720] block nvme2n1: no available path - failing I/O

And I'm ending up with _4_ namespaces:
# nvme list
Node             SN                   Model
       Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     SLESNVME1            QEMU NVMe Ctrl
       1          17.18  GB /  17.18  GB    512   B +  0 B   1.0
/dev/nvme1n1     SLESNVME2            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0
/dev/nvme2n1     SLESNVME3            QEMU NVMe Ctrl
       -1          0.00   B /   0.00   B      1   B +  0 B   1.0
/dev/nvme2n2     SLESNVME3            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0

and MD is still referencing the original one:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

when doing I/O MD will finally figure out that something is amiss:

[  152.636007] block nvme2n1: no available path - failing I/O
[  152.641562] block nvme2n1: no available path - failing I/O
[  152.645454] md: super_written gets error=-5
[  152.648799] md/raid1:md1: Disk failure on nvme2n1, disabling device.
[  152.648799] md/raid1:md1: Operation continuing on 1 devices.

but we're left with the problem that the re-attached namespace now has a
different device name (/dev/nvme2n2 vs /dev/nvme2n1), so MD cannot
reattach the device seamlessly and needs manual intervention.

Note: this has been tested with latest nvme-5.13, and the situation has
improved somewhat (as compared to previous versions) by the fact that MD
is now able to recover after manual interaction.
But we still end up with a namespace with the wrong name.

Cheers,

Hannes

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2021-05-10 13:02 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-01 12:04 [PATCHv3] nvme-mpath: delete disk after last connection Hannes Reinecke
2021-05-04  8:54 ` Christoph Hellwig
2021-05-04 13:40   ` Hannes Reinecke
2021-05-04 19:54     ` Sagi Grimberg
2021-05-05 15:26       ` Keith Busch
2021-05-05 16:15         ` Hannes Reinecke
2021-05-05 20:40           ` Sagi Grimberg
2021-05-06  2:50             ` Keith Busch
2021-05-06  6:13             ` Hannes Reinecke
2021-05-06  7:43       ` Christoph Hellwig
2021-05-06  8:42         ` Hannes Reinecke
2021-05-06  9:47           ` Sagi Grimberg
2021-05-06 12:08             ` Christoph Hellwig
2021-05-06 15:54               ` Hannes Reinecke
2021-05-07  6:46                 ` Christoph Hellwig
2021-05-07 17:02                   ` Hannes Reinecke
2021-05-07 17:20                     ` Sagi Grimberg
2021-05-10  6:23                     ` Christoph Hellwig
2021-05-10 13:01                       ` Hannes Reinecke [this message]
2021-05-10 13:57                         ` Hannes Reinecke
2021-05-10 14:48                       ` Hannes Reinecke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42bf4da0-c266-5413-8bac-949bdadd4024@suse.de \
    --to=hare@suse.de \
    --cc=daniel.wagner@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=keith.busch@wdc.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox