linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Li <chrisl@kernel.org>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	 Bjorn Helgaas <bhelgaas@google.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	 Danilo Krummrich <dakr@kernel.org>, Len Brown <lenb@kernel.org>,
	linux-kernel@vger.kernel.org,  linux-pci@vger.kernel.org,
	linux-acpi@vger.kernel.org,  David Matlack <dmatlack@google.com>,
	Pasha Tatashin <tatashin@google.com>,
	 Jason Miu <jasonmiu@google.com>,
	Vipin Sharma <vipinsh@google.com>,
	 Saeed Mahameed <saeedm@nvidia.com>,
	Adithya Jayachandran <ajayachandra@nvidia.com>,
	 Parav Pandit <parav@nvidia.com>, William Tu <witu@nvidia.com>,
	Mike Rapoport <rppt@kernel.org>,
	 Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH v2 06/10] PCI/LUO: Save and restore driver name
Date: Thu, 2 Oct 2025 15:30:24 -0700	[thread overview]
Message-ID: <CACePvbWw9G=y_cycWFMXxRbmuAE8yFCM0Z3y=Ojw30ENDkDL-g@mail.gmail.com> (raw)
In-Reply-To: <2025100225-abridge-shifty-3d50@gregkh>

On Wed, Oct 1, 2025 at 11:09 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> Just keeping a device "alive" while rebooting into the same exact kernel
> image seems odd to me given that this is almost never what people
> actually do.  They update their kernel with the weekly stable release to
> get the new bugfixes (remember we fix 13 CVEs a day), and away you go.
> You are saying that this workload would not actually be supported, so
> why do you want live update at all?  Who needs this?

I saw Pasha reply to a lot of your questions. I can take a stab on who
needs it. Others feel free to add/correct me. The major cloud vendor
(you know who is the usual suspect) providing GPU to the VM will want
it. The usage case is that the VM is controlled by the customer. The
cloud provider has a contract on how many maintenance downtimes to the
VM. Let's say X second maintenance downtime per year. When upgrading
the host kernel, typically the VM can be migrated to another host
without much interruption, so it does not take much from the down time
budget. However when you have a GPU attached to the VM, the GPU is
running some ML jobs, there is no good way to migrate that GPU context
to another machine. Instead, we can do a liveupdate from the host
kernel. During the liveupdate, the old kernel saves the liveupdate
state. VM is paused to memory while the GPU as a PCI device is kept on
running.  ML jobs are still up.  The kernel liveupdate kexec to the
new kernel version. Restore and reconstruct the software side of the
device state. VM re-attached to the file descriptor to get the
previous context. In the end the VM can resume running with the new
kernel while the GPU keeps running the ML job. From the VM point of
view, there are Y seconds the VM does not respond during the kexec.
The GPU did not lose the context and VM did not reboot. The benefit is
that Y second is much smaller than the time to reboot the VM  and
restart the GPU ML jobs. So that Y can fit into the X second
maintenance downtime per year in the service contract.

Hope that explanation makes sense to you.

Chris

  parent reply	other threads:[~2025-10-02 22:30 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-16  7:45 [PATCH v2 00/10] LUO: PCI subsystem (phase I) Chris Li
2025-09-16  7:45 ` [PATCH v2 01/10] PCI/LUO: Register with Liveupdate Orchestrator Chris Li
2025-09-30 15:15   ` Greg Kroah-Hartman
2025-09-30 23:41     ` Chris Li
2025-09-30 15:17   ` Greg Kroah-Hartman
2025-09-30 23:38     ` Chris Li
2025-09-16  7:45 ` [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list Chris Li
2025-09-29 17:46   ` Jason Gunthorpe
2025-09-30  2:13     ` Chris Li
2025-09-30 16:47       ` Jason Gunthorpe
2025-10-03  7:09         ` Chris Li
2025-10-03  5:33     ` Chris Li
2025-10-03 14:04       ` Jason Gunthorpe
2025-10-03 21:06         ` Chris Li
2025-09-30 15:26   ` Greg Kroah-Hartman
2025-10-03  6:57     ` Chris Li
2025-09-16  7:45 ` [PATCH v2 03/10] PCI/LUO: Forward prepare()/freeze()/cancel() callbacks to driver Chris Li
2025-09-29 17:48   ` Jason Gunthorpe
2025-09-30  2:11     ` Chris Li
2025-09-30 16:38       ` Jason Gunthorpe
2025-10-02 18:54         ` David Matlack
2025-10-02 20:57           ` Chris Li
2025-10-02 21:31             ` David Matlack
2025-10-02 23:21               ` Jason Gunthorpe
2025-10-02 23:42                 ` David Matlack
2025-10-03 12:03                   ` Jason Gunthorpe
2025-10-03 16:03                     ` David Matlack
2025-10-03 16:16                       ` Jason Gunthorpe
2025-10-03 16:28                         ` Pasha Tatashin
2025-10-03 16:56                           ` David Matlack
2025-10-03  5:24                 ` Chris Li
2025-10-03 12:06                   ` Jason Gunthorpe
2025-10-03 16:27                     ` David Matlack
2025-10-03 16:41                       ` Vipin Sharma
2025-10-03 17:44                     ` Chris Li
2025-10-03  5:17               ` Chris Li
2025-10-02 20:44         ` Chris Li
2025-09-30 15:27   ` Greg Kroah-Hartman
2025-10-02 20:38     ` Chris Li
2025-10-03  6:18       ` Greg Kroah-Hartman
2025-10-03  7:26         ` Chris Li
2025-10-03 12:26           ` Greg Kroah-Hartman
2025-10-03 17:49             ` Chris Li
2025-10-03 18:27               ` David Matlack
2025-10-03 21:10                 ` Chris Li
2025-09-16  7:45 ` [PATCH v2 04/10] PCI/LUO: Restore state at PCI enumeration Chris Li
2025-09-16  7:45 ` [PATCH v2 05/10] PCI/LUO: Forward finish callbacks to drivers Chris Li
2025-09-16  7:45 ` [PATCH v2 06/10] PCI/LUO: Save and restore driver name Chris Li
2025-09-29 17:57   ` Jason Gunthorpe
2025-09-30  2:10     ` Chris Li
2025-09-30 13:02       ` Pasha Tatashin
2025-09-30 13:41         ` Greg Kroah-Hartman
2025-09-30 14:53           ` Pasha Tatashin
2025-09-30 15:08             ` Greg Kroah-Hartman
2025-09-30 15:56               ` Pasha Tatashin
2025-10-01  5:06                 ` Greg Kroah-Hartman
2025-10-01 21:03                   ` Pasha Tatashin
2025-10-02  6:09                     ` Greg Kroah-Hartman
2025-10-02 13:23                       ` Jason Gunthorpe
2025-10-02 22:30                       ` Chris Li [this message]
2025-09-30 15:41           ` Chris Li
2025-10-01  5:13             ` Greg Kroah-Hartman
2025-10-02 22:05               ` Chris Li
2025-09-30 16:37         ` Jason Gunthorpe
2025-10-02 21:39           ` Chris Li
2025-10-03 14:28             ` Jason Gunthorpe
2025-09-16  7:45 ` [PATCH v2 07/10] PCI/LUO: Add liveupdate to pcieport driver Chris Li
2025-09-16  7:45 ` [PATCH v2 08/10] PCI/LUO: Add pci_liveupdate_get_driver_data() Chris Li
2025-09-16  7:45 ` [PATCH v2 09/10] PCI/LUO: Avoid write to bus master at boot Chris Li
2025-09-29 17:14   ` Bjorn Helgaas
2025-09-16  7:45 ` [PATCH v2 10/10] PCI: pci-lu-stub: Add a stub driver for Live Update testing Chris Li
2025-09-27 17:13 ` [PATCH v2 00/10] LUO: PCI subsystem (phase I) Bjorn Helgaas
2025-09-27 18:05   ` Pasha Tatashin
2025-09-29 15:04     ` Bjorn Helgaas
2025-09-29 18:13       ` Chris Li
2025-10-07 23:32         ` Chris Li
2025-10-08 23:00           ` David Matlack
2025-10-09 17:12             ` Chris Li
2025-10-09 23:21           ` Pratyush Yadav
2025-10-10  4:19             ` Chris Li
2025-10-10 23:49               ` Jason Miu
2025-10-13 13:58                 ` Pratyush Yadav
2025-10-14 16:11                   ` Pratyush Yadav
2025-10-14 20:44                   ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACePvbWw9G=y_cycWFMXxRbmuAE8yFCM0Z3y=Ojw30ENDkDL-g@mail.gmail.com' \
    --to=chrisl@kernel.org \
    --cc=ajayachandra@nvidia.com \
    --cc=bhelgaas@google.com \
    --cc=dakr@kernel.org \
    --cc=dmatlack@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jasonmiu@google.com \
    --cc=jgg@ziepe.ca \
    --cc=lenb@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=parav@nvidia.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=rafael@kernel.org \
    --cc=rppt@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=tatashin@google.com \
    --cc=vipinsh@google.com \
    --cc=witu@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).