linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Chris Li <chrisl@kernel.org>, Bjorn Helgaas <bhelgaas@google.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Danilo Krummrich <dakr@kernel.org>, Len Brown <lenb@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-acpi@vger.kernel.org, David Matlack <dmatlack@google.com>,
	Pasha Tatashin <tatashin@google.com>,
	Jason Miu <jasonmiu@google.com>,
	Vipin Sharma <vipinsh@google.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Adithya Jayachandran <ajayachandra@nvidia.com>,
	Parav Pandit <parav@nvidia.com>, William Tu <witu@nvidia.com>,
	Mike Rapoport <rppt@kernel.org>, Chris Li <chrisl@kernel.org>,
	Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at boot
Date: Mon, 28 Jul 2025 19:23:03 +0200	[thread overview]
Message-ID: <87zfconsaw.ffs@tglx> (raw)
In-Reply-To: <20250728-luo-pci-v1-20-955b078dd653@kernel.org>

On Mon, Jul 28 2025 at 01:24, Chris Li wrote:
> The liveupdate devices are already initialized by the kernel before the
> kexec. During the kexec the device is still running. Avoid write to the
> liveupdate devices during the new kernel boot up.

This change log is way too meager for this kind of change.

 1) You want to explain in detail how this works.

    "initialized by the kernel before the kexec" is as vague as it gets.

 2) Avoid write ....

    Again this lacks any information how this is supposed to work correctly.

>  drivers/pci/ats.c            |  7 ++--
>  drivers/pci/iov.c            | 58 ++++++++++++++++++------------
>  drivers/pci/msi/msi.c        | 32 ++++++++++++-----
>  drivers/pci/msi/pcidev_msi.c |  4 +--
>  drivers/pci/pci-acpi.c       |  3 ++
>  drivers/pci/pci.c            | 85 +++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.h            |  9 ++++-
>  drivers/pci/pcie/aspm.c      |  7 ++--
>  drivers/pci/pcie/pme.c       | 11 ++++--
>  drivers/pci/probe.c          | 43 +++++++++++++++-------
>  drivers/pci/setup-bus.c      | 10 +++++-

Then you sprinkle this stuff into files, which have completely different
purposes, without any explanation for the particular instances why they
are supposed to be correct and how this works.

I'm just looking at the MSI parts, as I have no expertise with the rest.

> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index 6ede55a7c5e652c80b51b10e58f0290eb6556430..7c40fde1ba0f89ad1d72064ac9e80696faeab426 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -113,7 +113,8 @@ static int pci_setup_msi_context(struct pci_dev *dev)
>  
>  void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
>  {
> -	raw_spinlock_t *lock = &to_pci_dev(desc->dev)->msi_lock;
> +	struct pci_dev *pci_dev = to_pci_dev(desc->dev);
> +	raw_spinlock_t *lock = &pci_dev->msi_lock;
>  	unsigned long flags;
>  
>  	if (!desc->pci.msi_attrib.can_mask)
> @@ -122,8 +123,9 @@ void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
>  	raw_spin_lock_irqsave(lock, flags);
>  	desc->pci.msi_mask &= ~clear;
>  	desc->pci.msi_mask |= set;
> -	pci_write_config_dword(msi_desc_to_pci_dev(desc), desc->pci.mask_pos,
> -			       desc->pci.msi_mask);
> +	if (!pci_lu_adopt(pci_dev))
> +		pci_write_config_dword(pci_dev, desc->pci.mask_pos,
> +				       desc->pci.msi_mask);

This results in inconsistent state, which is a bad idea to begin
with. How is cached software state and hardware state going to be
brought in sync at some point?

If you analyzed all places, which actually depend on hardware state and
make decisions based on it, for correctness, then you failed to provide
that analysis. If not, no comment.

>  	raw_spin_unlock_irqrestore(lock, flags);
>  }
>  
> @@ -190,6 +192,9 @@ static inline void pci_write_msg_msi(struct pci_dev *dev, struct msi_desc *desc,
>  	int pos = dev->msi_cap;
>  	u16 msgctl;
>  
> +	if (pci_lu_adopt(dev))
> +		return;
> +
>  	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
>  	msgctl &= ~PCI_MSI_FLAGS_QSIZE;
>  	msgctl |= FIELD_PREP(PCI_MSI_FLAGS_QSIZE, desc->pci.msi_attrib.multiple);
> @@ -214,6 +219,8 @@ static inline void pci_write_msg_msix(struct msi_desc *desc, struct msi_msg *msg
>  
>  	if (desc->pci.msi_attrib.is_virtual)
>  		return;
> +	if (pci_lu_adopt(to_pci_dev(desc->dev)))
> +		return;

So you don't allow the new kernel to write the MSI message, but the
interrupt subsystem has this new message and there are places which
utilize that cached message. How is this supposed to work?

>  	/*
>  	 * The specification mandates that the entry is masked
>  	 * when the message is modified:
> @@ -279,7 +286,8 @@ static void pci_msi_set_enable(struct pci_dev *dev, int enable)
>  	control &= ~PCI_MSI_FLAGS_ENABLE;
>  	if (enable)
>  		control |= PCI_MSI_FLAGS_ENABLE;
> -	pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control);
> +	if (!pci_lu_adopt(dev))
> +		pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control);

The placement of these conditionals is arbitrary. Some are the begin of
a function, others just block the write. Is that based on some logic or
were the places selected by shabby AI queries?

>  static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
> @@ -553,6 +561,7 @@ static void pci_msix_clear_and_set_ctrl(struct pci_dev *dev, u16 clear, u16 set)
>  {
>  	u16 ctrl;
>  
> +	BUG_ON(pci_lu_adopt(dev));

Not going to happen. BUG() is only appropriate when there is absolutely
no way to handle a situation. This is as undocumented as everything else
here.

>  	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &ctrl);
>  	ctrl &= ~clear;
>  	ctrl |= set;
> @@ -720,8 +729,9 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
>  	 * registers can be accessed.  Mask all the vectors to prevent
>  	 * interrupts coming in before they're fully set up.
>  	 */
> -	pci_msix_clear_and_set_ctrl(dev, 0, PCI_MSIX_FLAGS_MASKALL |
> -				    PCI_MSIX_FLAGS_ENABLE);
> +	if (!pci_lu_adopt(dev))
> +		pci_msix_clear_and_set_ctrl(dev, 0, PCI_MSIX_FLAGS_MASKALL |
> +					    PCI_MSIX_FLAGS_ENABLE);

And for enhanced annoyance you sprinkle this condition everywhere into
the code and then BUG() when you missed an instance. Because putting it
into the function which is invoked a gazillion of times would be too
obvious, right? That would at least be tasteful, but that's not the
primary problem of all this.

Sprinkling these conditionals all over the place is absolutely
unmaintainable, error prone and burdens everyone with this insanity and
the related hard to chase bugs.

Especially as there is no concept behind this and zero documentation how
any of this should work or even be remotely correct.

Before you start the next hackery, please sit down and write up coherent
explanations:

  What is the general concept of this?

  What is the exact state in which a device is left when the old kernel
  jumps into the new kernel?

  What is the state of the MSI[-X] or legacy PCI interrupts at this
  point?

  Can the device raise interrupts during the transition from the old to
  the new kernel?

  How is the "live" state of the device reflected and restored
  throughout the interrupt subsystem?

  How is the device driver supposed to attach to the same interrupt
  state as before?

  How are the potentially different Linux interrupt numbers mapped to
  the previous state?

Before this materializes and is agreed on, this is not going anywhere.

Thanks,

        tglx

  reply	other threads:[~2025-07-28 17:23 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-28  8:24 [RFC PATCH 00/25] Live Update Orchestrator: PCI subsystem Chris Li
2025-07-28  8:24 ` [PATCH RFC 01/25] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
2025-07-28  8:24 ` [PATCH RFC 02/25] PCI/LUO: Add struct dev_liveupdate Chris Li
2025-07-28  8:24 ` [PATCH RFC 03/25] PCI/LUO: Create requested liveupdate device list Chris Li
2025-07-28  8:24 ` [PATCH RFC 04/25] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver Chris Li
2025-07-28  8:24 ` [PATCH RFC 05/25] PCI/LUO: Restore state at PCI enumeration Chris Li
2025-07-28  8:24 ` [PATCH RFC 06/25] PCI/LUO: Forward finish callbacks to drivers Chris Li
2025-07-28  8:24 ` [PATCH RFC 07/25] PCI/LUO: Save and restore driver name Chris Li
2025-07-28  8:24 ` [PATCH RFC 08/25] PCI/LUO: Add liveupdate to pcieport driver Chris Li
2025-07-28  8:24 ` [PATCH RFC 09/25] PCI/LUO: Save SR-IOV number of VF Chris Li
2025-07-28  8:24 ` [PATCH RFC 10/25] PCI/LUO: Add pci_liveupdate_get_driver_data() Chris Li
2025-07-28  8:24 ` [PATCH RFC 11/25] PCI: pci-lu-stub: Add a stub driver for Live Update testing Chris Li
2025-07-28  8:24 ` [PATCH RFC 12/25] PCI/LUO: Save struct pci_dev info during prepare phase chrisl
2025-07-28  8:24 ` [PATCH RFC 13/25] PCI/LUO: Check the device function numbers in restoration chrisl
2025-07-28  8:24 ` [PATCH RFC 14/25] PCI/LUO: Restore power state of a PCI device chrisl
2025-07-28  8:24 ` [PATCH RFC 15/25] PCI/LUO: Restore PM related fields chrisl
2025-07-28  8:24 ` [PATCH RFC 16/25] PCI/LUO: Restore the pme_poll flag chrisl
2025-07-28  8:24 ` [PATCH RFC 17/25] PCI/LUO: Restore the no_d3cold flag chrisl
2025-07-28  8:24 ` [PATCH RFC 18/25] PCI/LUO: Restore pci_dev fields during probe chrisl
2025-07-28  8:24 ` [PATCH RFC 19/25] PCI/LUO: Track liveupdate buses Chris Li
2025-07-28  8:24 ` [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at boot Chris Li
2025-07-28 17:23   ` Thomas Gleixner [this message]
2025-07-28 23:50     ` Jason Gunthorpe
2025-07-30  4:13       ` Chris Li
2025-07-30  1:51     ` Chris Li
2025-07-31 15:01       ` Jason Gunthorpe
2025-08-01 23:04         ` Chris Li
2025-08-02 13:50           ` Jason Gunthorpe
2025-08-07  0:50             ` Chris Li
2025-07-28  8:24 ` [PATCH RFC 21/25] PCI/LUO: Save and restore the PCI resource chrisl
2025-07-28  8:24 ` [PATCH RFC 22/25] PCI/LUO: Save PCI bus and host bridge states chrisl
2025-07-28  8:24 ` [PATCH RFC 23/25] PCI/LUO: Check the PCI bus state after restoration chrisl
2025-07-28  8:24 ` [PATCH RFC 24/25] PCI: pci-lu-pf-stub: Add a PF stub driver for Live Update testing Chris Li
2025-07-28  8:24 ` [PATCH RFC 25/25] PCI/LUO: Clean up PCI_SER_GET() chrisl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zfconsaw.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=ajayachandra@nvidia.com \
    --cc=bhelgaas@google.com \
    --cc=chrisl@kernel.org \
    --cc=dakr@kernel.org \
    --cc=dmatlack@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jasonmiu@google.com \
    --cc=jgg@ziepe.ca \
    --cc=lenb@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=parav@nvidia.com \
    --cc=rafael@kernel.org \
    --cc=rppt@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=tatashin@google.com \
    --cc=vipinsh@google.com \
    --cc=witu@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).