Linux PCI subsystem development
 help / color / mirror / Atom feed
* pci_probe called concurrently in machine with 2 identical PCI devices causing race condition
@ 2025-06-26 10:14 Jozef Matejcik (Nokia)
  2025-06-26 12:08 ` Lukas Wunner
  0 siblings, 1 reply; 8+ messages in thread
From: Jozef Matejcik (Nokia) @ 2025-06-26 10:14 UTC (permalink / raw)
  To: linux-pci@vger.kernel.org

Hi kernel community,

We have one specific problem related to Linux PCI subsystem.

We have a device with 2 identical NPUs, so 2 identical PCI devices sharing the same 3rd party driver. Our problem is that _pci_probe of this driver is called concurrently from 2 kernel threads. It happens more frequently when kernel debug logs are enabled in GRUB, appr. every 20th or 30th reboot of the device.

I am writing this mail because it's possible this is generic issue of Linux PCI subsystem which may affect more people/companies - please correct me if I am wrong.

When digging for this in driver's source and Linux kernel source, I found this place in pci_call_probe:

    if (cpu < nr_cpu_ids)
        error = work_on_cpu(cpu, local_pci_probe, &ddi);
    else
        error = local_pci_probe(&ddi);

This was added in 0b2c2a71 in 2017. Quoting part of commit message:

    PCI: Replace the racy recursion prevention

    pci_call_probe() can called recursively when a physcial function is probed
    and the probing creates virtual functions, which are populated via
    pci_bus_add_device() which in turn can end up calling pci_call_probe()
    again.
 <end of quote>

So the fix is specifically related to devices with multiple VFs. But does this take into account the setup with 2 separate, but otherwise identical PCI devices? Is it possible this can occur in any machine with 2 identical PCI devices?

Snippet from dmesg (unfortunately, I am not sure how much I can share):

[   76.586492] linux-kernel-bde (154): DO_NOT_COMMIT: in _pci_probe at 2627
[   76.586494] linux-kernel-bde (154): DO_NOT_COMMIT: ctrl addr before: 0000000000000000, _ndevices: 0
[   76.586497] linux-kernel-bde (154): DO_NOT_COMMIT: ctrl addr after: 00000000f24dc905, _ndevices: 0
[   76.595735] linux-kernel-bde (4688): DO_NOT_COMMIT: _devices at 00000000f24dc905, sizeof(*_devices): 472
[   76.603415] linux-kernel-bde (154): DO_NOT_COMMIT: ctrl->dev_type set to 256
[   76.628884] linux-kernel-bde (4688): DO_NOT_COMMIT: dev->device: 8854
[   76.644076] linux-kernel-bde (4688): DO_NOT_COMMIT: in _pci_probe at 2627
[   76.661176] linux-kernel-bde (4688): DO_NOT_COMMIT: ctrl addr before: 0000000000000000, _ndevices: 0
[   76.679854] linux-kernel-bde (4688): DO_NOT_COMMIT: ctrl addr after: 00000000f24dc905, _ndevices: 0

I checked sources of several drivers for various PCI devices, but none of them seem to assume probe callback can be called from multiple threads.
Output of uname -a:
Linux Dut-A 6.1.128-13-amd64 #1 SMP PREEMPT_DYNAMIC Thu Jun 12 07:22:21 UTC 2025 x86_64 GNU/Linux

Regards,
Jozef

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-07-04  8:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26 10:14 pci_probe called concurrently in machine with 2 identical PCI devices causing race condition Jozef Matejcik (Nokia)
2025-06-26 12:08 ` Lukas Wunner
2025-06-26 12:20   ` Jozef Matejcik (Nokia)
2025-06-26 12:26     ` Lukas Wunner
2025-06-26 15:41       ` Keith Busch
2025-06-26 18:16         ` Jozef Matejcik (Nokia)
2025-06-26 22:37           ` Keith Busch
2025-07-04  8:03         ` Lukas Wunner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox