All of lore.kernel.org
 help / color / mirror / Atom feed
From: Niklas Schnelle <schnelle@linux.ibm.com>
To: Benjamin Block <bblock@linux.ibm.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Gerd Bayer <gbayer@linux.ibm.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Christian Borntraeger	 <borntraeger@linux.ibm.com>,
	Andreas Krebbel <krebbel@linux.ibm.com>,
	linux-pci	 <linux-pci@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Ionut Nechita <ionut_n2001@yahoo.com>,
	Tobias Schumacher <ts@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Matthew Rosato <mjrosato@linux.ibm.com>,
	linux-s390 <linux-s390@vger.kernel.org>,
	Julian Ruess	 <julianr@linux.ibm.com>,
	Ionut Nechita <ionut.nechita@windriver.com>,
	Farhan Ali <alifm@linux.ibm.com>
Subject: Re: [PATCH v2 3/3] s390/pci: Fix circular/recursive deadlocks in PCI-bus and -device release
Date: Thu, 12 Mar 2026 22:02:59 +0100	[thread overview]
Message-ID: <df10e28bd86227dfd4830c34da8c68976df5e938.camel@linux.ibm.com> (raw)
In-Reply-To: <354b9e4a54ced67f3c89df198041df19434fe4c8.1773235561.git.bblock@linux.ibm.com>

On Wed, 2026-03-11 at 14:27 +0100, Benjamin Block wrote:
> When removing PCI device or PCI bus objects there are a couple of
> call-chains where it is possible that the kernel runs into a circular or
> recursive deadlock involving taking the central `pci_rescan_remove_lock`.

Now that this depends on Ionut's patch at[0] and exploits the fact that
the rescan/remove lock can be taken recursively we may want to adjust
and thereby shorten this description to take this into account dropping
the recursive locking as deadlock reason.

> 
> For example:
> 
> (A) Thread α receives a CRW event notifying the kernel that a PCI

Nit: Other PCI related code calls these "PCI events" even though
technically they use the Channel Report Word (CRW) mechanism and thus
are CRW events.

> virtual function has been moved into Reserved state, and so the PCI
> subsystem will try to remove that PCI function. The call-chain for that
> looks like this:
>   __zpci_event_availability()
>     -> zpci_zdev_put()              # will   lock(zpci_add_remove_lock),
>                                     # and    lock(zpci_list_lock)
>     -> zpci_release_device()        # will unlock(zpci_list_lock)
>     -> zpci_cleanup_bus_resources() # will   lock(pci_rescan_remove_lock)
> 
> Thread β is triggered by userspace writing 0 into the SysFS attribute
> `sriov_numvfs` of the parent PCI physical function of the same function
> we just try to remove. This will also try to release the PCI virtual
> function; but this time the call-chain looks like this:
>   sriov_numvfs_store()              # will   lock(pci_rescan_remove_lock)

With Ionut's patch a prerequisite the above taking of
pci_rescan_remove_lock now happens in sriov_del_vfs() with a bit
shorter chain but same result.

>     -> ... (deep chain)
>     -> pci_release_dev()
>     -> pcibios_release_device()
>     -> zpci_zdev_put()              # will   lock(zpci_add_remove_lock)
> 
> If thread α and β coincide, this will result in a cyclic deadlock.
> 
> (B) Consider thread β from above; if the thread was to follow the same
> outlined call-chain for thread α, and not fall into the cyclic deadlock,
> it would recursive deadlock:
>   sriov_numvfs_store()              # will   lock(pci_rescan_remove_lock)
>     -> ... (deep chain)
>     -> pci_release_dev()
>     -> pcibios_release_device()
>     -> zpci_zdev_put()              # will   lock(zpci_add_remove_lock),
>                                     # and    lock(zpci_list_lock)
>     -> zpci_release_device()        # will unlock(zpci_list_lock)
>     -> zpci_cleanup_bus_resources() # will   lock(pci_rescan_remove_lock)

Since Ionut's patch must come before this, this behavior itself would
be okay (see argument above). Maybve just drop this part?

> 
> (C) And even in case `pci_rescan_remove_lock` was removed from
> zpci_cleanup_bus_resources(), the resulting call-chain would recursive
> deadlock when it tries to release the associated PCI bus:
>   sriov_numvfs_store()              # will   lock(pci_rescan_remove_lock)

Same as above after Ionut's change the lock is taken in sriov_del_vfs()

>     -> ... (deep chain)
>     -> pci_release_dev()
>     -> pcibios_release_device()
>     -> zpci_zdev_put()              # will   lock(zpci_add_remove_lock),
>                                     # and    lock(zpci_list_lock)
>     -> zpci_release_device()        # will unlock(zpci_list_lock)
>     -> zpci_bus_device_unregister()
>     -> zpci_bus_put()               # will   lock(zbus_list_lock)
>     -> zpci_bus_release()           # will unlock(zbus_list_lock),
>                                     # will   lock(pci_rescan_remove_lock)

Same argument as above for dropping the recursive deadlock part but the
below cyclic deadlock potential remains.

> 
> It can also easily be seen that scenario (C) offers the same risk as (A)
> for a cyclic deadlock in cases where `pci_rescan_remove_lock` is first
> locked, and the PCI bus released under its protection.

I think this part should spell out which two locks create the cycle

> 
> `pci_rescan_remove_lock` has to be and is taken at a "high level" in
> most call-chains since it is intended to protect/mutual exclude all
> rescan and/or removal actions taken in the PCI subsystem. So to prevent
> the outlined deadlock scenarios above remove it instead from the "low
> level" release function for both the PCI device and PCI bus objects.
> 
> Instead, lock `pci_rescan_remove_lock` in all call-chains leading to
> those release functions:
>   * initialization of the PCI subsystem;
>   * processing of availability events (CRWs) for PCI functions;
>   * processing of error events (CRWs) for PCI functions;
>   * architecture specific release PCI device implementation.
> 
> Additionally, remove `pci_rescan_remove_lock` from zpci_bus_scan_bus()
> since it is now always already taken when called.

Nit: I'd add an explicit statement that the locking order between
pci_rescan_remove_lock and zpci_add_remove_lock is thus reversed with
pci_rescan_remove_lockk always taken first then zpci_add_remove_lock.
Still it is understandable without it too.

> 
> Lastly, document the new locking expectations after these changes. Add
> sparse and lockdep annotations to functions that previously locked
> `pci_rescan_remove_lock` explicitly, making sure the lock is now
> already held when called. Additionally also add the annotations to
> zpci_zdev_put() and zpci_bus_put() to make sure that every function that
> potentially drops the last reference already holds the lock to prevent
> surprises.
> 
> Fixes: 05bc1be6db4b2 ("s390/pci: create zPCI bus")
> Fixes: ab909509850b2 ("PCI: s390: Fix use-after-free of PCI resources with per-function hotplug")
> Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
> ---
>  arch/s390/pci/pci.c       | 11 ++++++++---
>  arch/s390/pci/pci_bus.c   | 15 ++++++++-------
>  arch/s390/pci/pci_event.c | 28 +++++++++++++++++++---------
>  arch/s390/pci/pci_iov.c   |  3 +--
>  arch/s390/pci/pci_sysfs.c |  9 +++------
>  5 files changed, 39 insertions(+), 27 deletions(-)
> 

I had not previously looked at your commit message from the point of
view when it is ordered aferr Ionut's patch[0], which it has to for the
release case. So I ended up with a few comments now even though I
previously read it multiple times and was just very happy with the nice
and readable call chain and locking order explanations.

Either way the code itself looks good to me and I also did quite a bit
of testing with this and the previous versions. Trying hot(un)plug and
SR-IOV with various PCI devices and also just ran this and the previous
version on several test systems for a few days doing other testing and
development work. 

I also put quite a bit of effort into trying to modify this and/or
other s390 PCI code to not require the reentrant behavior of the
pci_rescan_remove_lock but failed. The really tricky part to me is how
e.g. a driver can hold on to a struct pci_dev even after it has been
removed while we have to keep both the struct zpci_dev and struct
zpci_bus and associated struct pci_bus exactly until the last
pci_dev_put(). And then needing to remove the PCI bus via
zpci_bus_release() we need to have the pci_rescan_remove lock but now
this can happen, in principle, at any pci_dev_put() and that of course
sometimes happens when we already have the pci_rescan_remove_lock and
sometimes without it.

So feel free to add my:

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>

Thanks,
Niklas

[0]https://lore.kernel.org/lkml/20260310074303.17480-2-ionut.nechita@windriver.com/

  reply	other threads:[~2026-03-12 21:04 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-11 13:27 [PATCH v2 0/3] PCI: s390/pci: Fix deadlocks on s390 when releasing zPCI-bus or -device objects Benjamin Block
2026-03-11 13:27 ` [PATCH v2 1/3] PCI: Move declaration of pci_rescan_remove_lock into public pci.h Benjamin Block
2026-03-12 19:41   ` Niklas Schnelle
2026-03-13 13:32     ` Benjamin Block
2026-03-11 13:27 ` [PATCH v2 2/3] PCI: Provide lock guard for pci_rescan_remove_lock Benjamin Block
2026-03-12 19:44   ` Niklas Schnelle
2026-03-13 13:32     ` Benjamin Block
2026-03-11 13:27 ` [PATCH v2 3/3] s390/pci: Fix circular/recursive deadlocks in PCI-bus and -device release Benjamin Block
2026-03-12 21:02   ` Niklas Schnelle [this message]
2026-03-13 13:59     ` Benjamin Block

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=df10e28bd86227dfd4830c34da8c68976df5e938.camel@linux.ibm.com \
    --to=schnelle@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=alifm@linux.ibm.com \
    --cc=bblock@linux.ibm.com \
    --cc=bhelgaas@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=gbayer@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=ionut.nechita@windriver.com \
    --cc=ionut_n2001@yahoo.com \
    --cc=julianr@linux.ibm.com \
    --cc=krebbel@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=svens@linux.ibm.com \
    --cc=ts@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.