public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jinhui Guo <guojinhui.liam@bytedance.com>
Cc: bhelgaas@google.com, bvanassche@acm.org,
	dan.j.williams@intel.com, alexander.h.duyck@linux.intel.com,
	gregkh@linuxfoundation.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	Marco Crivellari <marco.crivellari@suse.com>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers
Date: Mon, 29 Dec 2025 11:20:00 -0600	[thread overview]
Message-ID: <20251229172000.GA68570@bhelgaas> (raw)
In-Reply-To: <20251227113326.964-1-guojinhui.liam@bytedance.com>

[+cc Marco, Tejun; just FYI since you have ongoing per-CPU wq work]

On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of
> the driver") speeds up the loading of large numbers of device drivers by
> submitting asynchronous probe workers to an unbounded workqueue and binding
> each worker to the CPU near the device’s NUMA node. These workers are not
> scheduled on isolated CPUs because their cpumask is restricted to
> housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).
> 
> However, when PCI devices reside on the same NUMA node, all their
> drivers’ probe workers are bound to the same CPU within that node, yet
> the probes still run in parallel because pci_call_probe() invokes
> work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use
> in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on
> system_percpu_wq to bind the probe thread to the first CPU in the
> device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).
> 
> 1. The function __driver_attach() submits an asynchronous worker with
>    callback __driver_attach_async_helper().
> 
>    __driver_attach()
>     async_schedule_dev(__driver_attach_async_helper, dev)
>      async_schedule_node(func, dev, dev_to_node(dev))
>       async_schedule_node_domain(func, data, node, &async_dfl_domain)
>        __async_schedule_node_domain(func, data, node, domain, entry)
>         queue_work_node(node, async_wq, &entry->work)
> 
> 2. The asynchronous probe worker ultimately calls work_on_cpu() in
>    pci_call_probe(), binding the worker to the same CPU within the
>    device’s NUMA node.
> 
>    __driver_attach_async_helper()
>     driver_probe_device(drv, dev)
>      __driver_probe_device(drv, dev)
>       really_probe(dev, drv)
>        call_driver_probe(dev, drv)
>         dev->bus->probe(dev)
>          pci_device_probe(dev)
>           __pci_device_probe(drv, pci_dev)
>            pci_call_probe(drv, pci_dev, id)
>             cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask)
>             error = work_on_cpu(cpu, local_pci_probe, &ddi)
>              schedule_work_on(cpu, &wfc.work);
>               queue_work_on(cpu, system_percpu_wq, work)
> 
> To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> already running inside an unbounded asynchronous worker. Because a driver
> can be probed asynchronously either by probe_type or by the kernel command
> line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> executing within an unbounded workqueue worker and should skip the extra
> work_on_cpu() call.
> 
> Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
> 2.4 GHz processor shows a 35 % probe-time improvement with the patch:
> 
> Before (all on CPU 0):
>   nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
>   nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
>   nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
> 
> After (spread across CPUs 1, 2, 5):
>   nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns
>   nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns
>   nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns
> 
> The improvement grows with more PCI devices because fewer probes contend
> for the same CPU.
> 
> Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
> ---
>  drivers/pci/pci-driver.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 7c2d9d596258..4bc47a84d330 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  	/*
>  	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
>  	 * device is probed from work_on_cpu() of the Physical device.
> +	 * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous
> +	 * probe worker when the driver allows asynchronous probing.
>  	 */
>  	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
> -	    pci_physfn_is_probed(dev)) {
> +	    pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
>  		cpu = nr_cpu_ids;
>  	} else {
>  		cpumask_var_t wq_domain_mask;
> -- 
> 2.20.1

  reply	other threads:[~2025-12-29 17:20 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-27 11:33 [PATCH] PCI: Avoid work_on_cpu() in async probe workers Jinhui Guo
2025-12-29 17:20 ` Bjorn Helgaas [this message]
2025-12-29 18:08 ` Tejun Heo
2025-12-30 14:27   ` Jinhui Guo
2025-12-30 14:44     ` Jinhui Guo
2025-12-30 21:52     ` Bjorn Helgaas
2025-12-31  7:51       ` Jinhui Guo
2025-12-31 16:55         ` Bjorn Helgaas
2026-01-04 16:01           ` Danilo Krummrich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251229172000.GA68570@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=bvanassche@acm.org \
    --cc=dan.j.williams@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=guojinhui.liam@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=marco.crivellari@suse.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox