public inbox for linuxppc-dev@ozlabs.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Narayana Murty N <nnmlinux@linux.ibm.com>
Cc: mahesh@linux.ibm.com, oohall@gmail.com, maddy@linux.ibm.com,
	mpe@ellerman.id.au, npiggin@gmail.com,
	christophe.leroy@csgroup.eu, bhelgaas@google.com,
	tpearson@raptorengineering.com, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, vaibhav@linux.ibm.com,
	sbhat@linux.ibm.com, ganeshgr@linux.ibm.com
Subject: Re: [PATCH v2 1/1] powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling
Date: Wed, 10 Dec 2025 15:46:40 -0600	[thread overview]
Message-ID: <20251210214640.GA3541600@bhelgaas> (raw)
In-Reply-To: <20251210142559.8874-1-nnmlinux@linux.ibm.com>

On Wed, Dec 10, 2025 at 08:25:59AM -0600, Narayana Murty N wrote:
> The recent commit 1010b4c012b0 ("powerpc/eeh: Make EEH driver device
> hotplug safe") restructured the EEH driver to improve synchronization
> with the PCI hotplug layer.
> 
> However, it inadvertently moved pci_lock_rescan_remove() outside its
> intended scope in eeh_handle_normal_event(), leading to broken PCI
> error reporting and improper EEH event triggering. Specifically,
> eeh_handle_normal_event() acquired pci_lock_rescan_remove() before
> calling eeh_pe_bus_get(), but eeh_pe_bus_get() itself attempts to
> acquire the same lock internally, causing nested locking and disrupting
> normal EEH event handling paths.
> 
> This patch adds a boolean parameter do_lock to _eeh_pe_bus_get(),
> with two public wrappers:
>     eeh_pe_bus_get() with locking enabled.
>     eeh_pe_bus_get_nolock() that skips locking.
> 
> Callers that already hold pci_lock_rescan_remove() now use
> eeh_pe_bus_get_nolock() to avoid recursive lock acquisition.
> 
> Additionally, pci_lock_rescan_remove() calls are restored to the correct
> position—after eeh_pe_bus_get() and immediately before iterating affected
> PEs and devices. This ensures EEH-triggered PCI removes occur under proper
> bus rescan locking without recursive lock contention.
> 
> The eeh_pe_loc_get() function has been split into two functions:
>     eeh_pe_loc_get(struct eeh_pe *pe) which retrieves the loc for given PE.
>     eeh_pe_loc_get_bus(struct pci_bus *bus) which retrieves the location
>     code for given bus.
> 
> This resolves lockdep warnings such as:
> <snip>
> [   84.964298] [    T928] ============================================
> [   84.964304] [    T928] WARNING: possible recursive locking detected
> [   84.964311] [    T928] 6.18.0-rc3 #51 Not tainted
> [   84.964315] [    T928] --------------------------------------------
> [   84.964320] [    T928] eehd/928 is trying to acquire lock:
> [   84.964324] [    T928] c000000003b29d58 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40
> [   84.964342] [    T928]
>                        but task is already holding lock:
> [   84.964347] [    T928] c000000003b29d58 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40
> [   84.964357] [    T928]
>                        other info that might help us debug this:
> [   84.964363] [    T928]  Possible unsafe locking scenario:
> 
> [   84.964367] [    T928]        CPU0
> [   84.964370] [    T928]        ----
> [   84.964373] [    T928]   lock(pci_rescan_remove_lock);
> [   84.964378] [    T928]   lock(pci_rescan_remove_lock);
> [   84.964383] [    T928]
>                        *** DEADLOCK ***
> 
> [   84.964388] [    T928]  May be due to missing lock nesting notation
> 
> [   84.964393] [    T928] 1 lock held by eehd/928:
> [   84.964397] [    T928]  #0: c000000003b29d58 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40
> [   84.964408] [    T928]
>                        stack backtrace:
> [   84.964414] [    T928] CPU: 2 UID: 0 PID: 928 Comm: eehd Not tainted 6.18.0-rc3 #51 VOLUNTARY
> [   84.964417] [    T928] Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_022) hv:phyp pSeries
> [   84.964419] [    T928] Call Trace:
> [   84.964420] [    T928] [c0000011a7157990] [c000000001705de4] dump_stack_lvl+0xc8/0x130 (unreliable)
> [   84.964424] [    T928] [c0000011a71579d0] [c0000000002f66e0] print_deadlock_bug+0x430/0x440
> [   84.964428] [    T928] [c0000011a7157a70] [c0000000002fd0c0] __lock_acquire+0x1530/0x2d80
> [   84.964431] [    T928] [c0000011a7157ba0] [c0000000002fea54] lock_acquire+0x144/0x410
> [   84.964433] [    T928] [c0000011a7157cb0] [c0000011a7157cb0] __mutex_lock+0xf4/0x1050
> [   84.964436] [    T928] [c0000011a7157e00] [c000000000de21d8] pci_lock_rescan_remove+0x28/0x40
> [   84.964439] [    T928] [c0000011a7157e20] [c00000000004ed98] eeh_pe_bus_get+0x48/0xc0
> [   84.964442] [    T928] [c0000011a7157e50] [c000000000050434] eeh_handle_normal_event+0x64/0xa60
> [   84.964446] [    T928] [c0000011a7157f30] [c000000000051de8] eeh_event_handler+0xf8/0x190
> [   84.964450] [    T928] [c0000011a7157f90] [c0000000002747ac] kthread+0x16c/0x180
> [   84.964453] [    T928] [c0000011a7157fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18

I have no comment on the patch itself, but the timestamps above aren't
relevant and could be removed.  Maybe also the "T928" part.

Bjorn


  reply	other threads:[~2025-12-10 21:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-10 14:25 [PATCH v2 1/1] powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling Narayana Murty N
2025-12-10 21:46 ` Bjorn Helgaas [this message]
2025-12-15  6:35   ` Narayana Murty N
2025-12-11 15:45 ` Timothy Pearson
2025-12-17  5:01   ` Narayana Murty N
2025-12-22  5:35     ` Nilay Shroff
2026-01-05  9:47       ` Narayana Murty N
2026-01-07  6:22   ` Narayana Murty N
2026-01-14  3:04 ` Madhavan Srinivasan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251210214640.GA3541600@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=ganeshgr@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=maddy@linux.ibm.com \
    --cc=mahesh@linux.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=nnmlinux@linux.ibm.com \
    --cc=npiggin@gmail.com \
    --cc=oohall@gmail.com \
    --cc=sbhat@linux.ibm.com \
    --cc=tpearson@raptorengineering.com \
    --cc=vaibhav@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox