linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
@ 2024-06-05 12:48 Jiwei Sun
  2024-06-05 16:57 ` Nirmal Patel
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jiwei Sun @ 2024-06-05 12:48 UTC (permalink / raw)
  To: nirmal.patel, jonathan.derrick, paul.m.stillwell.jr
  Cc: lpieralisi, kw, robh, bhelgaas, linux-pci, linux-kernel, sunjw10,
	sjiwei, ahuang12

From: Jiwei Sun <sunjw10@lenovo.com>

During booting into the kernel, the following error message appears:

  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
  (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.

This symptom prevents the OS from booting successfully.

After a NVMe disk is probed/added by the nvme driver, the udevd executes
some rule scripts by invoking mdadm command to detect if there is a
mdraid associated with this NVMe disk. The mdadm determines if one
NVMe devce is connected to a particular VMD domain by checking the
domain symlink. Here is the root cause:

Thread A                   Thread B             Thread mdadm
vmd_enable_domain
  pci_bus_add_devices
    __driver_probe_device
     ...
     work_on_cpu
       schedule_work_on
       : wakeup Thread B
                           nvme_probe
                           : wakeup scan_work
                             to scan nvme disk
                             and add nvme disk
                             then wakeup udevd
                                                : udevd executes
                                                  mdadm command
       flush_work                               main
       : wait for nvme_probe done                ...
    __driver_probe_device                        find_driver_devices
    : probe next nvme device                     : 1) Detect the domain
    ...                                            symlink; 2) Find the
    ...                                            domain symlink from
    ...                                            vmd sysfs; 3) The
    ...                                            domain symlink is not
    ...                                            created yet, failed
  sysfs_create_link
  : create domain symlink

sysfs_create_link() is invoked at the end of vmd_enable_domain().
However, this implementation introduces a timing issue, where mdadm
might fail to retrieve the vmd symlink path because the symlink has not
been created yet.

Fix the issue by creating VMD domain symlinks before invoking
pci_bus_add_devices().

Signed-off-by: Jiwei Sun <sunjw10@lenovo.com>
Suggested-by: Adrian Huang <ahuang12@lenovo.com>
---
v3 changes:
 - Per Paul's comment, move sysfs_remove_link() after
   pci_stop_root_bus()

v2 changes:
 - Add "()" after function names in subject and commit log
 - Move sysfs_create_link() after vmd_attach_resources()

 drivers/pci/controller/vmd.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 87b7856f375a..4e7fe2e13cac 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -925,6 +925,9 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 		dev_set_msi_domain(&vmd->bus->dev,
 				   dev_get_msi_domain(&vmd->dev->dev));
 
+	WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
+			       "domain"), "Can't create symlink to domain\n");
+
 	vmd_acpi_begin();
 
 	pci_scan_child_bus(vmd->bus);
@@ -964,9 +967,6 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 	pci_bus_add_devices(vmd->bus);
 
 	vmd_acpi_end();
-
-	WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
-			       "domain"), "Can't create symlink to domain\n");
 	return 0;
 }
 
@@ -1042,8 +1042,8 @@ static void vmd_remove(struct pci_dev *dev)
 {
 	struct vmd_dev *vmd = pci_get_drvdata(dev);
 
-	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
 	pci_stop_root_bus(vmd->bus);
+	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
 	pci_remove_root_bus(vmd->bus);
 	vmd_cleanup_srcu(vmd);
 	vmd_detach_resources(vmd);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
  2024-06-05 12:48 [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices() Jiwei Sun
@ 2024-06-05 16:57 ` Nirmal Patel
  2024-07-06  3:22 ` Krzysztof Wilczyński
  2024-07-09 20:59 ` Bjorn Helgaas
  2 siblings, 0 replies; 8+ messages in thread
From: Nirmal Patel @ 2024-06-05 16:57 UTC (permalink / raw)
  To: Jiwei Sun
  Cc: jonathan.derrick, paul.m.stillwell.jr, lpieralisi, kw, robh,
	bhelgaas, linux-pci, linux-kernel, sunjw10, ahuang12

On Wed,  5 Jun 2024 20:48:44 +0800
Jiwei Sun <sjiwei@163.com> wrote:

> From: Jiwei Sun <sunjw10@lenovo.com>
> 
> During booting into the kernel, the following error message appears:
> 
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err)
> 'mdadm: Unable to get real path for
> '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
> (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err)
> 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
> (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err)
> 'mdadm: No OROM/EFI properties for /dev/nvme1n1' (udev-worker)[2149]:
> nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID
> superblock on /dev/nvme1n1.' (udev-worker)[2149]: nvme1n1: Process
> '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.
> 
> This symptom prevents the OS from booting successfully.
> 
> After a NVMe disk is probed/added by the nvme driver, the udevd
> executes some rule scripts by invoking mdadm command to detect if
> there is a mdraid associated with this NVMe disk. The mdadm
> determines if one NVMe devce is connected to a particular VMD domain
> by checking the domain symlink. Here is the root cause:
> 
> Thread A                   Thread B             Thread mdadm
> vmd_enable_domain
>   pci_bus_add_devices
>     __driver_probe_device
>      ...
>      work_on_cpu
>        schedule_work_on
>        : wakeup Thread B
>                            nvme_probe
>                            : wakeup scan_work
>                              to scan nvme disk
>                              and add nvme disk
>                              then wakeup udevd
>                                                 : udevd executes
>                                                   mdadm command
>        flush_work                               main
>        : wait for nvme_probe done                ...
>     __driver_probe_device                        find_driver_devices
>     : probe next nvme device                     : 1) Detect the
> domain ...                                            symlink; 2)
> Find the ...                                            domain
> symlink from ...                                            vmd
> sysfs; 3) The ...                                            domain
> symlink is not ...                                            created
> yet, failed sysfs_create_link
>   : create domain symlink
> 
> sysfs_create_link() is invoked at the end of vmd_enable_domain().
> However, this implementation introduces a timing issue, where mdadm
> might fail to retrieve the vmd symlink path because the symlink has
> not been created yet.
> 
> Fix the issue by creating VMD domain symlinks before invoking
> pci_bus_add_devices().
> 
> Signed-off-by: Jiwei Sun <sunjw10@lenovo.com>
> Suggested-by: Adrian Huang <ahuang12@lenovo.com>
> ---
> v3 changes:
>  - Per Paul's comment, move sysfs_remove_link() after
>    pci_stop_root_bus()
> 
> v2 changes:
>  - Add "()" after function names in subject and commit log
>  - Move sysfs_create_link() after vmd_attach_resources()
> 
>  drivers/pci/controller/vmd.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/controller/vmd.c
> b/drivers/pci/controller/vmd.c index 87b7856f375a..4e7fe2e13cac 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -925,6 +925,9 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> unsigned long features) dev_set_msi_domain(&vmd->bus->dev,
>  				   dev_get_msi_domain(&vmd->dev->dev));
>  
> +	WARN(sysfs_create_link(&vmd->dev->dev.kobj,
> &vmd->bus->dev.kobj,
> +			       "domain"), "Can't create symlink to
> domain\n"); +
>  	vmd_acpi_begin();
>  
>  	pci_scan_child_bus(vmd->bus);
> @@ -964,9 +967,6 @@ static int vmd_enable_domain(struct vmd_dev *vmd,
> unsigned long features) pci_bus_add_devices(vmd->bus);
>  
>  	vmd_acpi_end();
> -
> -	WARN(sysfs_create_link(&vmd->dev->dev.kobj,
> &vmd->bus->dev.kobj,
> -			       "domain"), "Can't create symlink to
> domain\n"); return 0;
>  }
>  
> @@ -1042,8 +1042,8 @@ static void vmd_remove(struct pci_dev *dev)
>  {
>  	struct vmd_dev *vmd = pci_get_drvdata(dev);
>  
> -	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
>  	pci_stop_root_bus(vmd->bus);
> +	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
>  	pci_remove_root_bus(vmd->bus);
>  	vmd_cleanup_srcu(vmd);
>  	vmd_detach_resources(vmd);

Reviewed-by: Nirmal Patel <nirmal.patel@linux.intel.com>

Thanks

-nirmal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
  2024-06-05 12:48 [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices() Jiwei Sun
  2024-06-05 16:57 ` Nirmal Patel
@ 2024-07-06  3:22 ` Krzysztof Wilczyński
  2024-07-09 20:59 ` Bjorn Helgaas
  2 siblings, 0 replies; 8+ messages in thread
From: Krzysztof Wilczyński @ 2024-07-06  3:22 UTC (permalink / raw)
  To: Jiwei Sun
  Cc: nirmal.patel, jonathan.derrick, paul.m.stillwell.jr, lpieralisi,
	robh, bhelgaas, linux-pci, linux-kernel, sunjw10, ahuang12

Hello,

> During booting into the kernel, the following error message appears:
> 
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
>   (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.
> 
> This symptom prevents the OS from booting successfully.
> 
> After a NVMe disk is probed/added by the nvme driver, the udevd executes
> some rule scripts by invoking mdadm command to detect if there is a
> mdraid associated with this NVMe disk. The mdadm determines if one
> NVMe devce is connected to a particular VMD domain by checking the
> domain symlink. Here is the root cause:
> 
> Thread A                   Thread B             Thread mdadm
> vmd_enable_domain
>   pci_bus_add_devices
>     __driver_probe_device
>      ...
>      work_on_cpu
>        schedule_work_on
>        : wakeup Thread B
>                            nvme_probe
>                            : wakeup scan_work
>                              to scan nvme disk
>                              and add nvme disk
>                              then wakeup udevd
>                                                 : udevd executes
>                                                   mdadm command
>        flush_work                               main
>        : wait for nvme_probe done                ...
>     __driver_probe_device                        find_driver_devices
>     : probe next nvme device                     : 1) Detect the domain
>     ...                                            symlink; 2) Find the
>     ...                                            domain symlink from
>     ...                                            vmd sysfs; 3) The
>     ...                                            domain symlink is not
>     ...                                            created yet, failed
>   sysfs_create_link
>   : create domain symlink
> 
> sysfs_create_link() is invoked at the end of vmd_enable_domain().
> However, this implementation introduces a timing issue, where mdadm
> might fail to retrieve the vmd symlink path because the symlink has not
> been created yet.
> 
> Fix the issue by creating VMD domain symlinks before invoking
> pci_bus_add_devices().

Applied to vmd, thank you!

[1/1] PCI: vmd: Create domain symlink before pci_bus_add_devices()
      https://git.kernel.org/pci/pci/c/7a13782e6150

	Krzysztof

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
  2024-06-05 12:48 [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices() Jiwei Sun
  2024-06-05 16:57 ` Nirmal Patel
  2024-07-06  3:22 ` Krzysztof Wilczyński
@ 2024-07-09 20:59 ` Bjorn Helgaas
  2024-07-10 13:29   ` Jiwei Sun
  2 siblings, 1 reply; 8+ messages in thread
From: Bjorn Helgaas @ 2024-07-09 20:59 UTC (permalink / raw)
  To: Jiwei Sun
  Cc: nirmal.patel, jonathan.derrick, paul.m.stillwell.jr, lpieralisi,
	kw, robh, bhelgaas, linux-pci, linux-kernel, sunjw10, ahuang12,
	Pawel Baldysiak, Alexey Obitotskiy, Tomasz Majchrzak

[+cc Pawel, Alexey, Tomasz for mdadm history]

On Wed, Jun 05, 2024 at 08:48:44PM +0800, Jiwei Sun wrote:
> From: Jiwei Sun <sunjw10@lenovo.com>
> 
> During booting into the kernel, the following error message appears:
> 
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
>   (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.
> 
> This symptom prevents the OS from booting successfully.

I guess the root filesystem must be on a RAID device, and it's the
failure to assemble that RAID device that prevents OS boot?  The
messages are just details about why the assembly failed?

> After a NVMe disk is probed/added by the nvme driver, the udevd executes
> some rule scripts by invoking mdadm command to detect if there is a
> mdraid associated with this NVMe disk. The mdadm determines if one
> NVMe devce is connected to a particular VMD domain by checking the
> domain symlink. Here is the root cause:

Can you tell us something about what makes this a vmd-specific issue?

I guess vmd is the only driver that creates a "domain" symlink, so
*that* part is vmd-specific.  But I guess there's something in mdadm
or its configuration that looks for that symlink?

I suppose it has to do with the mdadm code at [1] and the commit at
[2]?

[1] https://github.com/md-raid-utilities/mdadm/blob/96b8035a09b6449ea99f2eb91f9ba4f6912e5bd6/platform-intel.c#L199
[2] https://github.com/md-raid-utilities/mdadm/commit/60f0f54d6f5227f229e7131d34f93f76688b085f

I assume this is a race between vmd_enable_domain() and mdadm?  And
vmd_enable_domain() only loses the race sometimes?  Trying to figure
out why this hasn't been reported before or on non-VMD configurations.
Now that I found [2], the non-VMD part is obvious, but I'm still
curious about why we haven't seen it before.

The VMD device is sort of like another host bridge, and I wouldn't
think mdadm would normally care about a host bridge, but it looks like
mdadm does need to know about VMD for some reason.

> Thread A                   Thread B             Thread mdadm
> vmd_enable_domain
>   pci_bus_add_devices
>     __driver_probe_device
>      ...
>      work_on_cpu
>        schedule_work_on
>        : wakeup Thread B
>                            nvme_probe
>                            : wakeup scan_work
>                              to scan nvme disk
>                              and add nvme disk
>                              then wakeup udevd
>                                                 : udevd executes
>                                                   mdadm command
>        flush_work                               main
>        : wait for nvme_probe done                ...
>     __driver_probe_device                        find_driver_devices
>     : probe next nvme device                     : 1) Detect the domain
>     ...                                            symlink; 2) Find the
>     ...                                            domain symlink from
>     ...                                            vmd sysfs; 3) The
>     ...                                            domain symlink is not
>     ...                                            created yet, failed
>   sysfs_create_link
>   : create domain symlink
> 
> sysfs_create_link() is invoked at the end of vmd_enable_domain().
> However, this implementation introduces a timing issue, where mdadm
> might fail to retrieve the vmd symlink path because the symlink has not
> been created yet.
> 
> Fix the issue by creating VMD domain symlinks before invoking
> pci_bus_add_devices().
> 
> Signed-off-by: Jiwei Sun <sunjw10@lenovo.com>
> Suggested-by: Adrian Huang <ahuang12@lenovo.com>
> ---
> v3 changes:
>  - Per Paul's comment, move sysfs_remove_link() after
>    pci_stop_root_bus()
> 
> v2 changes:
>  - Add "()" after function names in subject and commit log
>  - Move sysfs_create_link() after vmd_attach_resources()
> 
>  drivers/pci/controller/vmd.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 87b7856f375a..4e7fe2e13cac 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -925,6 +925,9 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>  		dev_set_msi_domain(&vmd->bus->dev,
>  				   dev_get_msi_domain(&vmd->dev->dev));
>  
> +	WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
> +			       "domain"), "Can't create symlink to domain\n");
> +
>  	vmd_acpi_begin();
>  
>  	pci_scan_child_bus(vmd->bus);
> @@ -964,9 +967,6 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>  	pci_bus_add_devices(vmd->bus);
>  
>  	vmd_acpi_end();
> -
> -	WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
> -			       "domain"), "Can't create symlink to domain\n");
>  	return 0;
>  }
>  
> @@ -1042,8 +1042,8 @@ static void vmd_remove(struct pci_dev *dev)
>  {
>  	struct vmd_dev *vmd = pci_get_drvdata(dev);
>  
> -	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
>  	pci_stop_root_bus(vmd->bus);
> +	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
>  	pci_remove_root_bus(vmd->bus);
>  	vmd_cleanup_srcu(vmd);
>  	vmd_detach_resources(vmd);
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
  2024-07-09 20:59 ` Bjorn Helgaas
@ 2024-07-10 13:29   ` Jiwei Sun
  2024-07-10 22:16     ` Bjorn Helgaas
  0 siblings, 1 reply; 8+ messages in thread
From: Jiwei Sun @ 2024-07-10 13:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: nirmal.patel, jonathan.derrick, paul.m.stillwell.jr, lpieralisi,
	kw, robh, bhelgaas, linux-pci, linux-kernel, sunjw10, ahuang12,
	Pawel Baldysiak, Alexey Obitotskiy, Tomasz Majchrzak


On 7/10/24 04:59, Bjorn Helgaas wrote:
> [+cc Pawel, Alexey, Tomasz for mdadm history]
> 
> On Wed, Jun 05, 2024 at 08:48:44PM +0800, Jiwei Sun wrote:
>> From: Jiwei Sun <sunjw10@lenovo.com>
>>
>> During booting into the kernel, the following error message appears:
>>
>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
>>   (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.
>>
>> This symptom prevents the OS from booting successfully.
> 
> I guess the root filesystem must be on a RAID device, and it's the
> failure to assemble that RAID device that prevents OS boot?  The
> messages are just details about why the assembly failed?

Yes, you are right, in our test environment, we installed the SLES15SP6
on a VROC RAID 1 device which is set up by two NVME hard drivers. And
there is also a hardware RAID kit on the motherboard with other two NVME 
hard drivers.

# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme0n1     259:1    0     7T  0 disk
├─nvme0n1p1 259:2    0   512M  0 part
├─nvme0n1p2 259:3    0    40G  0 part
└─nvme0n1p3 259:4    0   6.9T  0 part
nvme1n1     259:6    0   1.7T  0 disk
├─md126       9:126  0   1.7T  0 raid1
│ ├─md126p1 259:9    0   600M  0 part  /boot/efi
│ ├─md126p2 259:10   0     1G  0 part
│ ├─md126p3 259:11   0    40G  0 part  /usr/local
│ │                                    /var
│ │                                    /tmp
│ │                                    /srv
│ │                                    /root
│ │                                    /opt
│ │                                    /boot/grub2/x86_64-efi
│ │                                    /boot/grub2/i386-pc
│ │                                    /.snapshots
│ │                                    /
│ ├─md126p4 259:12   0   1.5T  0 part  /home
│ └─md126p5 259:13   0 125.5G  0 part  [SWAP]
└─md127       9:127  0     0B  0 md
nvme2n1     259:8    0   1.7T  0 disk
├─md126       9:126  0   1.7T  0 raid1
│ ├─md126p1 259:9    0   600M  0 part  /boot/efi
│ ├─md126p2 259:10   0     1G  0 part
│ ├─md126p3 259:11   0    40G  0 part  /usr/local
│ │                                    /var
│ │                                    /tmp
│ │                                    /srv
│ │                                    /root
│ │                                    /opt
│ │                                    /boot/grub2/x86_64-efi
│ │                                    /boot/grub2/i386-pc
│ │                                    /.snapshots
│ │                                    /
│ ├─md126p4 259:12   0   1.5T  0 part  /home
│ └─md126p5 259:13   0 125.5G  0 part  [SWAP]
└─md127       9:127  0     0B  0 md

And the nvme0n1 is hardware RAID kit,
the nvme1n1 and nvme2n1 is VROC.

The OS entered emergency mode after installation and restart. And we
found that no RAID was detected on the first NVME hard driver according
to the emergency mode log, but when we try to detect it manually by 
using the following command

#/sbin/mdadm -I /dev/nvme1n1

It works, the RAID device was found, it tells us that the RAID is just
not detected in the booting process. We added the "rd.udev.debug" to 
kernel's cmdline, the error logs appears.

  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
  (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
  (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.

> 
>> After a NVMe disk is probed/added by the nvme driver, the udevd executes
>> some rule scripts by invoking mdadm command to detect if there is a
>> mdraid associated with this NVMe disk. The mdadm determines if one
>> NVMe devce is connected to a particular VMD domain by checking the
>> domain symlink. Here is the root cause:
> 
> Can you tell us something about what makes this a vmd-specific issue?
> 
> I guess vmd is the only driver that creates a "domain" symlink, so
> *that* part is vmd-specific.  But I guess there's something in mdadm
> or its configuration that looks for that symlink?

According to the following error log,

  mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device

We enable the dyndbg log of the drivers/base/*(add dyndbg="file drivers/base/* +p" to cmdline),
and add some debug logs in the find_driver_devices() of the madam source code [1],
We found that the "domain" has not been created when the error log appears.
And as you stated, the "domain" symlink is created by vmd driver.

> 
> I suppose it has to do with the mdadm code at [1] and the commit at
> [2]?

Yes, you are right, mdadm determine which NVMe dev is connected to the VMD
domain by checking "/sys/bus/%s/drivers/%s/%s/domain/device", according to the following mdadm code[1],
              /*
              * Each VMD device (domain) adds separate PCI bus, it is better
              * to store path as a path to that bus (easier further
              * determination which NVMe dev is connected to this particular
              * VMD domain).
              */
              if (type == SYS_DEV_VMD) {
                     sprintf(path, "/sys/bus/%s/drivers/%s/%s/domain/device",
                            bus, driver, de->d_name);
              }
              p = realpath(path, NULL);
              if (p == NULL) {
                     pr_err("Unable to get real path for '%s'\n", path);
                     continue;
              }

[1] https://github.com/md-raid-utilities/mdadm/blob/main/platform-intel.c#L208

> 
> [1] https://github.com/md-raid-utilities/mdadm/blob/96b8035a09b6449ea99f2eb91f9ba4f6912e5bd6/platform-intel.c#L199
> [2] https://github.com/md-raid-utilities/mdadm/commit/60f0f54d6f5227f229e7131d34f93f76688b085f
> 
> I assume this is a race between vmd_enable_domain() and mdadm?  And
> vmd_enable_domain() only loses the race sometimes?  Trying to figure

Yes, you are right, what you said is the conclusion of our investigation.

> out why this hasn't been reported before or on non-VMD configurations.
> Now that I found [2], the non-VMD part is obvious, but I'm still
> curious about why we haven't seen it before.

According to our test, if we remove that hardware RAID kit, the issue
can't be reproduced, the driver name on the hardware RAID kit is nvme0, 
and the driver name on VROC is nvme1 and nvme2. It seems the special 
configuration makes the problem more apparent.

> 
> The VMD device is sort of like another host bridge, and I wouldn't
> think mdadm would normally care about a host bridge, but it looks like
> mdadm does need to know about VMD for some reason.

I think so, it is better if the application doesn't have to pay too 
much attention to hardware details. Just we can see, the mdadm uses 
the "domain" symlink.

Thanks,
Regards,
Jiwei

> 
>> Thread A                   Thread B             Thread mdadm
>> vmd_enable_domain
>>   pci_bus_add_devices
>>     __driver_probe_device
>>      ...
>>      work_on_cpu
>>        schedule_work_on
>>        : wakeup Thread B
>>                            nvme_probe
>>                            : wakeup scan_work
>>                              to scan nvme disk
>>                              and add nvme disk
>>                              then wakeup udevd
>>                                                 : udevd executes
>>                                                   mdadm command
>>        flush_work                               main
>>        : wait for nvme_probe done                ...
>>     __driver_probe_device                        find_driver_devices
>>     : probe next nvme device                     : 1) Detect the domain
>>     ...                                            symlink; 2) Find the
>>     ...                                            domain symlink from
>>     ...                                            vmd sysfs; 3) The
>>     ...                                            domain symlink is not
>>     ...                                            created yet, failed
>>   sysfs_create_link
>>   : create domain symlink
>>
>> sysfs_create_link() is invoked at the end of vmd_enable_domain().
>> However, this implementation introduces a timing issue, where mdadm
>> might fail to retrieve the vmd symlink path because the symlink has not
>> been created yet.
>>
>> Fix the issue by creating VMD domain symlinks before invoking
>> pci_bus_add_devices().
>>
>> Signed-off-by: Jiwei Sun <sunjw10@lenovo.com>
>> Suggested-by: Adrian Huang <ahuang12@lenovo.com>
>> ---
>> v3 changes:
>>  - Per Paul's comment, move sysfs_remove_link() after
>>    pci_stop_root_bus()
>>
>> v2 changes:
>>  - Add "()" after function names in subject and commit log
>>  - Move sysfs_create_link() after vmd_attach_resources()
>>
>>  drivers/pci/controller/vmd.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
>> index 87b7856f375a..4e7fe2e13cac 100644
>> --- a/drivers/pci/controller/vmd.c
>> +++ b/drivers/pci/controller/vmd.c
>> @@ -925,6 +925,9 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>>  		dev_set_msi_domain(&vmd->bus->dev,
>>  				   dev_get_msi_domain(&vmd->dev->dev));
>>  
>> +	WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
>> +			       "domain"), "Can't create symlink to domain\n");
>> +
>>  	vmd_acpi_begin();
>>  
>>  	pci_scan_child_bus(vmd->bus);
>> @@ -964,9 +967,6 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>>  	pci_bus_add_devices(vmd->bus);
>>  
>>  	vmd_acpi_end();
>> -
>> -	WARN(sysfs_create_link(&vmd->dev->dev.kobj, &vmd->bus->dev.kobj,
>> -			       "domain"), "Can't create symlink to domain\n");
>>  	return 0;
>>  }
>>  
>> @@ -1042,8 +1042,8 @@ static void vmd_remove(struct pci_dev *dev)
>>  {
>>  	struct vmd_dev *vmd = pci_get_drvdata(dev);
>>  
>> -	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
>>  	pci_stop_root_bus(vmd->bus);
>> +	sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
>>  	pci_remove_root_bus(vmd->bus);
>>  	vmd_cleanup_srcu(vmd);
>>  	vmd_detach_resources(vmd);
>> -- 
>> 2.27.0
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
  2024-07-10 13:29   ` Jiwei Sun
@ 2024-07-10 22:16     ` Bjorn Helgaas
  2024-07-11  1:32       ` Jiwei Sun
  0 siblings, 1 reply; 8+ messages in thread
From: Bjorn Helgaas @ 2024-07-10 22:16 UTC (permalink / raw)
  To: Jiwei Sun
  Cc: nirmal.patel, jonathan.derrick, paul.m.stillwell.jr, lpieralisi,
	kw, robh, bhelgaas, linux-pci, linux-kernel, sunjw10, ahuang12

[-cc Pawel, Alexey, Tomasz, which all bounced]

On Wed, Jul 10, 2024 at 09:29:25PM +0800, Jiwei Sun wrote:
> On 7/10/24 04:59, Bjorn Helgaas wrote:
> > [+cc Pawel, Alexey, Tomasz for mdadm history]
> > On Wed, Jun 05, 2024 at 08:48:44PM +0800, Jiwei Sun wrote:
> >> From: Jiwei Sun <sunjw10@lenovo.com>
> >>
> >> During booting into the kernel, the following error message appears:
> >>
> >>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
> >>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
> >>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
> >>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
> >>   (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.
> >>
> >> This symptom prevents the OS from booting successfully.
> > 
> > I guess the root filesystem must be on a RAID device, and it's the
> > failure to assemble that RAID device that prevents OS boot?  The
> > messages are just details about why the assembly failed?
> 
> Yes, you are right, in our test environment, we installed the SLES15SP6
> on a VROC RAID 1 device which is set up by two NVME hard drivers. And
> there is also a hardware RAID kit on the motherboard with other two NVME 
> hard drivers.

OK, thanks for all the details.  What would you think of updating the
commit log like this?

  The vmd driver creates a "domain" symlink in sysfs for each VMD bridge.
  Previously this symlink was created after pci_bus_add_devices() added
  devices below the VMD bridge and emitted udev events to announce them to
  userspace.

  This led to a race between userspace consumers of the udev events and the
  kernel creation of the symlink.  One such consumer is mdadm, which
  assembles block devices into a RAID array, and for devices below a VMD
  bridge, mdadm depends on the "domain" symlink.

  If mdadm loses the race, it may be unable to assemble a RAID array, which
  may cause a boot failure or other issues, with complaints like this:

  ...

  Create the VMD "domain" symlink before invoking pci_bus_add_devices() to
  avoid this race.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
  2024-07-10 22:16     ` Bjorn Helgaas
@ 2024-07-11  1:32       ` Jiwei Sun
  2024-07-11 16:12         ` Bjorn Helgaas
  0 siblings, 1 reply; 8+ messages in thread
From: Jiwei Sun @ 2024-07-11  1:32 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: nirmal.patel, jonathan.derrick, paul.m.stillwell.jr, lpieralisi,
	kw, robh, bhelgaas, linux-pci, linux-kernel, sunjw10, ahuang12


On 7/11/24 06:16, Bjorn Helgaas wrote:
> [-cc Pawel, Alexey, Tomasz, which all bounced]
> 
> On Wed, Jul 10, 2024 at 09:29:25PM +0800, Jiwei Sun wrote:
>> On 7/10/24 04:59, Bjorn Helgaas wrote:
>>> [+cc Pawel, Alexey, Tomasz for mdadm history]
>>> On Wed, Jun 05, 2024 at 08:48:44PM +0800, Jiwei Sun wrote:
>>>> From: Jiwei Sun <sunjw10@lenovo.com>
>>>>
>>>> During booting into the kernel, the following error message appears:
>>>>
>>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
>>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
>>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
>>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
>>>>   (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.
>>>>
>>>> This symptom prevents the OS from booting successfully.
>>>
>>> I guess the root filesystem must be on a RAID device, and it's the
>>> failure to assemble that RAID device that prevents OS boot?  The
>>> messages are just details about why the assembly failed?
>>
>> Yes, you are right, in our test environment, we installed the SLES15SP6
>> on a VROC RAID 1 device which is set up by two NVME hard drivers. And
>> there is also a hardware RAID kit on the motherboard with other two NVME 
>> hard drivers.
> 
> OK, thanks for all the details.  What would you think of updating the
> commit log like this?

Thanks, I think this commit log is clearer than before. Do I need to 
send another v4 patch for the changes?

Thanks,
Regards,
Jiwei

> 
>   The vmd driver creates a "domain" symlink in sysfs for each VMD bridge.
>   Previously this symlink was created after pci_bus_add_devices() added
>   devices below the VMD bridge and emitted udev events to announce them to
>   userspace.
> 
>   This led to a race between userspace consumers of the udev events and the
>   kernel creation of the symlink.  One such consumer is mdadm, which
>   assembles block devices into a RAID array, and for devices below a VMD
>   bridge, mdadm depends on the "domain" symlink.
> 
>   If mdadm loses the race, it may be unable to assemble a RAID array, which
>   may cause a boot failure or other issues, with complaints like this:
> 
>   ...
> 
>   Create the VMD "domain" symlink before invoking pci_bus_add_devices() to
>   avoid this race.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices()
  2024-07-11  1:32       ` Jiwei Sun
@ 2024-07-11 16:12         ` Bjorn Helgaas
  0 siblings, 0 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2024-07-11 16:12 UTC (permalink / raw)
  To: Jiwei Sun
  Cc: nirmal.patel, jonathan.derrick, paul.m.stillwell.jr, lpieralisi,
	kw, robh, bhelgaas, linux-pci, linux-kernel, sunjw10, ahuang12

On Thu, Jul 11, 2024 at 09:32:46AM +0800, Jiwei Sun wrote:
> 
> On 7/11/24 06:16, Bjorn Helgaas wrote:
> > [-cc Pawel, Alexey, Tomasz, which all bounced]
> > 
> > On Wed, Jul 10, 2024 at 09:29:25PM +0800, Jiwei Sun wrote:
> >> On 7/10/24 04:59, Bjorn Helgaas wrote:
> >>> [+cc Pawel, Alexey, Tomasz for mdadm history]
> >>> On Wed, Jun 05, 2024 at 08:48:44PM +0800, Jiwei Sun wrote:
> >>>> From: Jiwei Sun <sunjw10@lenovo.com>
> >>>>
> >>>> During booting into the kernel, the following error message appears:
> >>>>
> >>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: Unable to get real path for '/sys/bus/pci/drivers/vmd/0000:c7:00.5/domain/device''
> >>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: /dev/nvme1n1 is not attached to Intel(R) RAID controller.'
> >>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: No OROM/EFI properties for /dev/nvme1n1'
> >>>>   (udev-worker)[2149]: nvme1n1: '/sbin/mdadm -I /dev/nvme1n1'(err) 'mdadm: no RAID superblock on /dev/nvme1n1.'
> >>>>   (udev-worker)[2149]: nvme1n1: Process '/sbin/mdadm -I /dev/nvme1n1' failed with exit code 1.
> >>>>
> >>>> This symptom prevents the OS from booting successfully.
> >>>
> >>> I guess the root filesystem must be on a RAID device, and it's the
> >>> failure to assemble that RAID device that prevents OS boot?  The
> >>> messages are just details about why the assembly failed?
> >>
> >> Yes, you are right, in our test environment, we installed the SLES15SP6
> >> on a VROC RAID 1 device which is set up by two NVME hard drivers. And
> >> there is also a hardware RAID kit on the motherboard with other two NVME 
> >> hard drivers.
> > 
> > OK, thanks for all the details.  What would you think of updating the
> > commit log like this?
> 
> Thanks, I think this commit log is clearer than before. Do I need to 
> send another v4 patch for the changes?

No need, if you think it's OK, I can update the commit log locally.

> >   The vmd driver creates a "domain" symlink in sysfs for each VMD bridge.
> >   Previously this symlink was created after pci_bus_add_devices() added
> >   devices below the VMD bridge and emitted udev events to announce them to
> >   userspace.
> > 
> >   This led to a race between userspace consumers of the udev events and the
> >   kernel creation of the symlink.  One such consumer is mdadm, which
> >   assembles block devices into a RAID array, and for devices below a VMD
> >   bridge, mdadm depends on the "domain" symlink.
> > 
> >   If mdadm loses the race, it may be unable to assemble a RAID array, which
> >   may cause a boot failure or other issues, with complaints like this:
> > 
> >   ...
> > 
> >   Create the VMD "domain" symlink before invoking pci_bus_add_devices() to
> >   avoid this race.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-07-11 16:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-05 12:48 [PATCH v3] PCI: vmd: Create domain symlink before pci_bus_add_devices() Jiwei Sun
2024-06-05 16:57 ` Nirmal Patel
2024-07-06  3:22 ` Krzysztof Wilczyński
2024-07-09 20:59 ` Bjorn Helgaas
2024-07-10 13:29   ` Jiwei Sun
2024-07-10 22:16     ` Bjorn Helgaas
2024-07-11  1:32       ` Jiwei Sun
2024-07-11 16:12         ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).