All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jinhui Guo <guojinhui.liam@bytedance.com>
Cc: tj@kernel.org, alexander.h.duyck@linux.intel.com,
	bhelgaas@google.com, bvanassche@acm.org,
	dan.j.williams@intel.com, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	stable@vger.kernel.org
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers
Date: Tue, 30 Dec 2025 15:52:41 -0600	[thread overview]
Message-ID: <20251230215241.GA130710@bhelgaas> (raw)
In-Reply-To: <20251230142736.1168-1-guojinhui.liam@bytedance.com>

On Tue, Dec 30, 2025 at 10:27:36PM +0800, Jinhui Guo wrote:
> On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> > On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > > already running inside an unbounded asynchronous worker. Because a driver
> > > can be probed asynchronously either by probe_type or by the kernel command
> > > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > > executing within an unbounded workqueue worker and should skip the extra
> > > work_on_cpu() call.
> > 
> > Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> > workqueue)? Those are soft-affine to cache domain but can overflow to other
> > CPUs?
> 
> Hi, tejun,
> 
> Thank you for your time and helpful suggestions.
> I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
> flush_work(), but that would be a refactor rather than a fix for the specific
> problem we hit.
> 
> Let me restate the issue:
> 
> 1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
>    async_wq to speed up driver probe.
> 2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
>    device’s NUMA node, but it always picks the same CPU for every device on
>    that node, forcing the PCI probes to run serially.
> 
> Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
> inside an async_wq worker and skip the extra nested work queue.
> 
> I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
> would be cleaner and would let different vendors’ drivers probe in parallel
> instead of fighting over the same CPU. I’ve prepared and tested another patch,
> but I’m still unsure it’s the better approach; any further suggestions would
> be greatly appreciated.
> 
> Test results for that patch:
>   nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
>   nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
>   nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
> 
> Key changes in the patch:
> 
> 1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
> 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
>    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
> 3. Remove all cpumask operations.
> 4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
>    are gone.
> 
> The patch is shown below.

I love this patch because it makes pci_call_probe() so much simpler.

I *would* like a short higher-level description of the issue that
doesn't assume so much workqueue background.

I'm not an expert, but IIUC __driver_attach() schedules async workers
so driver probes can run in parallel, but the problem is that the
workers for devices on node X are currently serialized because they
all bind to the same CPU on that node.

Naive questions: It looks like async_schedule_dev() already schedules
an async worker on the device node, so why does pci_call_probe() need
to use queue_work_node() again?

pci_call_probe() dates to 2005 (d42c69972b85 ("[PATCH] PCI: Run PCI
driver initialization on local node")), but the async_schedule_dev()
looks like it was only added in 2019 (c37e20eaf4b2 ("driver core:
Attach devices on CPU local to device node")).  Maybe the
pci_call_probe() node awareness is no longer necessary?

> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 7c2d9d5962586..e66a67c48f28d 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -347,10 +347,24 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
>  #endif
>  }
> 
> +struct pci_probe_work {
> +    struct work_struct work;
> +    struct drv_dev_and_id ddi;
> +    int result;
> +};
> +
> +static void pci_probe_work_func(struct work_struct *work)
> +{
> +       struct pci_probe_work *pw = container_of(work, struct pci_probe_work, work);
> +
> +       pw->result = local_pci_probe(&pw->ddi);
> +}
> +
>  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>                           const struct pci_device_id *id)
>  {
>         int error, node, cpu;
> +       struct pci_probe_work pw;
>         struct drv_dev_and_id ddi = { drv, dev, id };
> 
>         /*
> @@ -361,38 +375,25 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>         node = dev_to_node(&dev->dev);
>         dev->is_probed = 1;
> 
> -       cpu_hotplug_disable();
> -
>         /*
>          * Prevent nesting work_on_cpu() for the case where a Virtual Function
>          * device is probed from work_on_cpu() of the Physical device.
>          */
>         if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
> -           pci_physfn_is_probed(dev)) {
> -               cpu = nr_cpu_ids;
> -       } else {
> -               cpumask_var_t wq_domain_mask;
> -
> -               if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
> -                       error = -ENOMEM;
> -                       goto out;
> -               }
> -               cpumask_and(wq_domain_mask,
> -                           housekeeping_cpumask(HK_TYPE_WQ),
> -                           housekeeping_cpumask(HK_TYPE_DOMAIN));
> -
> -               cpu = cpumask_any_and(cpumask_of_node(node),
> -                                     wq_domain_mask);
> -               free_cpumask_var(wq_domain_mask);
> +           pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
> +               error = local_pci_probe(&ddi);
> +               goto out;
>         }
> 
> -       if (cpu < nr_cpu_ids)
> -               error = work_on_cpu(cpu, local_pci_probe, &ddi);
> -       else
> -               error = local_pci_probe(&ddi);
> +       INIT_WORK_ONSTACK(&pw.work, pci_probe_work_func);
> +       pw.ddi = ddi;
> +       queue_work_node(node, system_dfl_wq, &pw.work);
> +       flush_work(&pw.work);
> +       error = pw.result;
> +       destroy_work_on_stack(&pw.work);
> +
>  out:
>         dev->is_probed = 0;
> -       cpu_hotplug_enable();
>         return error;
>  }
> 
> 
> Best Regards,
> Jinhui

  parent reply	other threads:[~2025-12-30 21:52 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-27 11:33 [PATCH] PCI: Avoid work_on_cpu() in async probe workers Jinhui Guo
2025-12-29 17:20 ` Bjorn Helgaas
2025-12-29 18:08 ` Tejun Heo
2025-12-30 14:27   ` Jinhui Guo
2025-12-30 14:44     ` Jinhui Guo
2025-12-30 21:52     ` Bjorn Helgaas [this message]
2025-12-31  7:51       ` Jinhui Guo
2025-12-31 16:55         ` Bjorn Helgaas
2026-01-04 16:01           ` Danilo Krummrich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251230215241.GA130710@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=bvanassche@acm.org \
    --cc=dan.j.williams@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=guojinhui.liam@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.