public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jinhui Guo <guojinhui.liam@bytedance.com>
Cc: alexander.h.duyck@linux.intel.com,
	Bjorn Helgaas <bhelgaas@google.com>,
	Bart Van Assche <bvanassche@acm.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	stable@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	Danilo Krummrich <dakr@kernel.org>,
	Alexander Duyck <alexanderduyck@fb.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers
Date: Wed, 31 Dec 2025 10:55:03 -0600	[thread overview]
Message-ID: <20251231165503.GA159243@bhelgaas> (raw)
In-Reply-To: <20251231075105.1368-1-guojinhui.liam@bytedance.com>

[+cc Rafael, Danilo (driver core question), update Alexander's email]

On Wed, Dec 31, 2025 at 03:51:05PM +0800, Jinhui Guo wrote:
> On Tue, Dec 30, 2025 at 03:52:41PM -0600, Bjorn Helgaas wrote:
> > On Tue, Dec 30, 2025 at 10:27:36PM +0800, Jinhui Guo wrote:
> > > On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> > > > On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > > > > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > > > > already running inside an unbounded asynchronous worker. Because a driver
> > > > > can be probed asynchronously either by probe_type or by the kernel command
> > > > > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > > > > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > > > > executing within an unbounded workqueue worker and should skip the extra
> > > > > work_on_cpu() call.
> > > > 
> > > > Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> > > > workqueue)? Those are soft-affine to cache domain but can overflow to other
> > > > CPUs?
> > > 
> > > Hi, tejun,
> > > 
> > > Thank you for your time and helpful suggestions.
> > > I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
> > > flush_work(), but that would be a refactor rather than a fix for the specific
> > > problem we hit.
> > > 
> > > Let me restate the issue:
> > > 
> > > 1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
> > >    async_wq to speed up driver probe.
> > > 2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
> > >    device’s NUMA node, but it always picks the same CPU for every device on
> > >    that node, forcing the PCI probes to run serially.
> > > 
> > > Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
> > > inside an async_wq worker and skip the extra nested work queue.
> > > 
> > > I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
> > > would be cleaner and would let different vendors’ drivers probe in parallel
> > > instead of fighting over the same CPU. I’ve prepared and tested another patch,
> > > but I’m still unsure it’s the better approach; any further suggestions would
> > > be greatly appreciated.
> > > 
> > > Test results for that patch:
> > >   nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
> > >   nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
> > >   nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
> > > 
> > > Key changes in the patch:
> > > 
> > > 1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
> > > 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
> > >    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
> > > 3. Remove all cpumask operations.
> > > 4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
> > >    are gone.
> > > 
> > > The patch is shown below.
> > 
> > I love this patch because it makes pci_call_probe() so much simpler.
> > 
> > I *would* like a short higher-level description of the issue that
> > doesn't assume so much workqueue background.
> > 
> > I'm not an expert, but IIUC __driver_attach() schedules async workers
> > so driver probes can run in parallel, but the problem is that the
> > workers for devices on node X are currently serialized because they
> > all bind to the same CPU on that node.
> > 
> > Naive questions: It looks like async_schedule_dev() already schedules
> > an async worker on the device node, so why does pci_call_probe() need
> > to use queue_work_node() again?
> > 
> > pci_call_probe() dates to 2005 (d42c69972b85 ("[PATCH] PCI: Run PCI
> > driver initialization on local node")), but the async_schedule_dev()
> > looks like it was only added in 2019 (c37e20eaf4b2 ("driver core:
> > Attach devices on CPU local to device node")).  Maybe the
> > pci_call_probe() node awareness is no longer necessary?
> 
> Hi, Bjorn
> 
> Thank you for your time and kind reply.
> 
> As I see it, two scenarios should be borne in mind:
> 
> 1. Driver allowed to probe asynchronously
>    The driver core schedules async workers via async_schedule_dev(),
>    so pci_call_probe() needs no extra queue_work_node().
> 
> 2. Driver not allowed to probe asynchronously
>    The driver core (__driver_attach() or __device_attach()) calls
>    pci_call_probe() directly, without any async worker from
>    async_schedule_dev(). NUMA-node awareness in pci_call_probe()
>    is therefore still required.

Good point, we need the NUMA awareness in both sync and async probe
paths.

But node affinity is orthogonal to the sync/async question, so it
seems weird to deal with affinity in two separate places.  It also
seems sub-optimal to have node affinity in the driver core async path
but not the synchronous probe path.

Maybe driver_probe_device() should do something about NUMA affinity?

Bjorn

  reply	other threads:[~2025-12-31 16:55 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-27 11:33 [PATCH] PCI: Avoid work_on_cpu() in async probe workers Jinhui Guo
2025-12-29 17:20 ` Bjorn Helgaas
2025-12-29 18:08 ` Tejun Heo
2025-12-30 14:27   ` Jinhui Guo
2025-12-30 14:44     ` Jinhui Guo
2025-12-30 21:52     ` Bjorn Helgaas
2025-12-31  7:51       ` Jinhui Guo
2025-12-31 16:55         ` Bjorn Helgaas [this message]
2026-01-04 16:01           ` Danilo Krummrich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251231165503.GA159243@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=alexanderduyck@fb.com \
    --cc=bhelgaas@google.com \
    --cc=bvanassche@acm.org \
    --cc=dakr@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=guojinhui.liam@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox