Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [Hypervisor Live Update] Notes from June 1, 2026
@ 2026-06-07 16:06 Pasha Tatashin
       [not found] ` <20260702060202.GA78893@chenghao-pc>
  0 siblings, 1 reply; 2+ messages in thread
From: Pasha Tatashin @ 2026-06-07 16:06 UTC (permalink / raw)
  To: Alexander Graf, Andersen, Tycho, Anthony Yznaga, Baolu Lu,
	David Hildenbrand, David Matlack, Heyne, Maximillian,
	James Gowans, Jason Gunthorpe, Mike Rapoport, Pankaj Gupta,
	Pasha Tatashin, Pratyush Yadav, Praveen Kumar, Vipin Sharma,
	Vishal Annapurve, Woodhouse, David, Luca Boccassi,
	Samiullah Khawaja, Jork Loeser
  Cc: linux-mm, kexec, linux-kernel

Hi everybody,

Here are the notes from the Hypervisor Live Update call that happened on 
Monday, June 1. Thanks to everybody who was involved!

These notes are intended to bring people up to speed who could not 
attend the call as well as keep the conversation going in between 
meetings.

----->o-----
LPC 2026 Call for Proposals

The Call for Proposals for the Live Update Microconference at LPC 2026 
is officially open. Please submit your topics and proposals before the 
deadline on July 24th.

https://lore.kernel.org/all/ahcc3Qyuy7Oy03Iq@plex

----->o-----
KHO Xarray Implementation & Core Data Structures

Pratyush is collaborating with Mike on a KHO fallback allocation 
strategy for memblock. Alongside this, Pratyush is designing a 
serialized, sparse "KHO Xarray" data structure to lift current mapping 
restrictions across all three memfd types (shared, hugeTLB, and 
guest_memfd). By allowing runtime page faults and allocation tracking 
post-preservation, this avoids flat vmalloc array scalability 
limitations.

Potential wider use cases for the KHO Xarray were discussed:
- MSHV sparse bitmap tracking.
- IOMMU page table tracking (Samiullah will evaluate domain/device tree 
  association fit).
- PCI/VFIO sparse tracking via Bus/Device/Function (BDF) key spaces.

Slab/Cache Preservation vs. Linked Blocks:
David Matlack noted that using an Xarray page per PCI device would be 
too expensive given their small struct sizes. Pratyush suggested 
preserving slab caches via dedicated kmem_cache flags to manage small, 
arbitrary allocations. As an immediate alternative, Pasha's ongoing LUO 
limits refactor series introduces a highly compact block-linked list 
structure optimized for runtime file/session tracking. David Matlack 
will review if this fits the PCI core tracking requirements.

----->o-----
LUO Limit Removal & PCI Core Status

LUO Refactor: Pasha is updating the LUO series to address Pratyush's 
comments (primarily renaming iterator functions) and plans to send out 
v2 shortly. Given that LUO is not yet in fleet production, the group 
agreed to fast-track this into the upcoming merge window to align with 
systemd's fdstore integration.

PCI Core v6: David Matlack sent out v6 incorporating two critical fixes 
spotted by Sachiko regarding get/put semantics and double-retrieval 
failures. Review tags from the live update team are needed to help 
secure Bjorn's Ack once he returns from vacation next week.

----->o-----
IOMMU Persistence & Process Memory

IOMMU v3: Samiullah is addressing recent review feedback on the IOMMU 
persistence series and intends to post v3 by the end of this week. The 
associated development roadmap document has received positive 
stakeholder attention.

CRIU & vm_splice: Maximilian's investigation into optimizing vm_splice 
for copy-less data preservation remains deferred but remains in the 
pipeline, with potential future collaboration with Google's tmpfs splice 
efforts.

----->o-----
guest_memfd Enlightenment & VMM Documentation

Tarun debriefed the community on his upstream presentation regarding the 
initial guest_memfd preservation patch series (currently covering fully 
shared mappings with page-sized folios).

Key design and architecture alignments include:
- VM File Association: guest_memfd requires an active 'struct kvm' 
  context to be retrieved. VMMs must preserve the parent VM file 
  alongside guest_memfd, using LUO tokens to re-link them on the 
  incoming kernel path. This sets the stage for future private 
  mapping/secure EPT table tracking.
- Relaxed Fault Logic: The group agreed to drop strict upfront pre-fault 
  checks. Instead, standard runtime page-fault semantics will apply. If 
  a guest page fault occurs post-preservation, it will bubble up via 
  standard KVM_RUN ioctl exits to the VMM, which can safely pause vCPUs 
  and retry the fault post-kexec.
- Centralized VMM Documentation: Pasha and David Matlack proposed 
  creating a centralized guide under live_update/vmm detailing the 
  overall live update flow, timing constraints, and subsystem 
  requirements to assist external QEMU and VMM developers.

----->o-----
Next meeting will be on Monday, June 15 at 8am PDT (UTC-7), everybody is
welcome: https://meet.google.com/rjn-dmzu-hgq

Note: I am going to be traveling on June 15th, David Matlack is going to
be hosting it.

Topics for the next meeting:
- Presentation of VFIO roadmap (Vipin and David Matlack)
- Status of KHO Xarray development and slab preservation feasibility
- Review of PCI core changes v7 and upstream merge coordination
- IOMMU persistence v3 review feedback
- Detailed review of guest_memfd v2 and VMM interaction documentation
- Review and coordination of LPC 2026 Microconference topic submissions
- later: KHO support for Confidential VMs including page table
  preservation and pinning
- later: versioning support for luod to negotiate
- later: KHO enlightenment for ASI
- later: update on PCI preservation series and next steps
- later: testing methodology to allow downstream consumers to qualify
  that live update works from one version to another
- later: reducing blackout window during live update, including deferred
  struct page initialization

Please let me know if you'd like to propose additional topics for
discussion, thank you!


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Hypervisor Live Update] Notes from June 1, 2026
       [not found]   ` <20260702-pay-effect-93be8ab0@mheyne-amazon>
@ 2026-07-03  9:02     ` Chenghao Duan
  0 siblings, 0 replies; 2+ messages in thread
From: Chenghao Duan @ 2026-07-03  9:02 UTC (permalink / raw)
  To: Maximilian Heyne
  Cc: Pasha Tatashin, Alexander Graf, Andersen, Tycho, Anthony Yznaga,
	Baolu Lu, David Hildenbrand, David Matlack, James Gowans,
	Jason Gunthorpe, Mike Rapoport, Pankaj Gupta, Pratyush Yadav,
	Praveen Kumar, Vipin Sharma, Vishal Annapurve, Woodhouse, David,
	Luca Boccassi, Samiullah Khawaja, Jork Loeser, linux-mm, kexec,
	linux-kernel, jianghaoran

On Thu, Jul 02, 2026 at 12:00:16PM +0000, Maximilian Heyne wrote:
> On Thu, Jul 02, 2026 at 02:02:02PM +0800, Chenghao Duan wrote:
> > On Sun, Jun 07, 2026 at 12:06:01PM -0400, Pasha Tatashin wrote:
> > > Hi everybody,
> > > 
> > > Here are the notes from the Hypervisor Live Update call that happened on 
> > > Monday, June 1. Thanks to everybody who was involved!
> > > 
> > > These notes are intended to bring people up to speed who could not 
> > > attend the call as well as keep the conversation going in between 
> > > meetings.
> > > 
> > > ----->o-----
> > > LPC 2026 Call for Proposals
> > > 
> > > The Call for Proposals for the Live Update Microconference at LPC 2026 
> > > is officially open. Please submit your topics and proposals before the 
> > > deadline on July 24th.
> > > 
> > > https://lore.kernel.org/all/ahcc3Qyuy7Oy03Iq@plex
> > > 
> > > ----->o-----
> > > KHO Xarray Implementation & Core Data Structures
> > > 
> > > Pratyush is collaborating with Mike on a KHO fallback allocation 
> > > strategy for memblock. Alongside this, Pratyush is designing a 
> > > serialized, sparse "KHO Xarray" data structure to lift current mapping 
> > > restrictions across all three memfd types (shared, hugeTLB, and 
> > > guest_memfd). By allowing runtime page faults and allocation tracking 
> > > post-preservation, this avoids flat vmalloc array scalability 
> > > limitations.
> > > 
> > > Potential wider use cases for the KHO Xarray were discussed:
> > > - MSHV sparse bitmap tracking.
> > > - IOMMU page table tracking (Samiullah will evaluate domain/device tree 
> > >   association fit).
> > > - PCI/VFIO sparse tracking via Bus/Device/Function (BDF) key spaces.
> > > 
> > > Slab/Cache Preservation vs. Linked Blocks:
> > > David Matlack noted that using an Xarray page per PCI device would be 
> > > too expensive given their small struct sizes. Pratyush suggested 
> > > preserving slab caches via dedicated kmem_cache flags to manage small, 
> > > arbitrary allocations. As an immediate alternative, Pasha's ongoing LUO 
> > > limits refactor series introduces a highly compact block-linked list 
> > > structure optimized for runtime file/session tracking. David Matlack 
> > > will review if this fits the PCI core tracking requirements.
> > > 
> > > ----->o-----
> > > LUO Limit Removal & PCI Core Status
> > > 
> > > LUO Refactor: Pasha is updating the LUO series to address Pratyush's 
> > > comments (primarily renaming iterator functions) and plans to send out 
> > > v2 shortly. Given that LUO is not yet in fleet production, the group 
> > > agreed to fast-track this into the upcoming merge window to align with 
> > > systemd's fdstore integration.
> > > 
> > > PCI Core v6: David Matlack sent out v6 incorporating two critical fixes 
> > > spotted by Sachiko regarding get/put semantics and double-retrieval 
> > > failures. Review tags from the live update team are needed to help 
> > > secure Bjorn's Ack once he returns from vacation next week.
> > > 
> > > ----->o-----
> > > IOMMU Persistence & Process Memory
> > > 
> > > IOMMU v3: Samiullah is addressing recent review feedback on the IOMMU 
> > > persistence series and intends to post v3 by the end of this week. The 
> > > associated development roadmap document has received positive 
> > > stakeholder attention.
> > > 
> > > CRIU & vm_splice: Maximilian's investigation into optimizing vm_splice 
> > > for copy-less data preservation remains deferred but remains in the 
> > > pipeline, with potential future collaboration with Google's tmpfs splice 
> > > efforts.
> > 
> > I’ve also been researching a combination solution integrating CRIU and
> > KHO. My approach stores all image data dumped by CRIU into memfd, then
> > persists those memfd objects via KHO/LUO.
> 
> I've experimented with exactly the same approach plus if a process
> already has memfd's, don't dump (to yet another memfd) but preserve this
> memfd directly via KHO.
> 
> > 
> > I’ve reviewed the historical meeting notes and would like to clarify:
> > does the CRIU solution discussed in the meetings aim to save the full
> > set of a process’s metadata and data, or only the anonymous memory and
> > shared memory allocated during the process runtime?
> 
> I've been the only one bringing this up in the meetings and my idea is
> to enable a reboot with negigible process downtime. So save the state of
> a process, kexec and resume the process. Currently, preservation and
> restorations are quite slow when processes have a lot of anonymous
> memory as this needs to be moved to a preservable memfd first. So what
> I'm researching is how I can convert anonymous memory efficiently into
> something that can be preserved (currently memfd).
> 
> And to answer your question, I'd say save data and metadata in memfd's.
> As the alternative would be to save the metadata on disk which would be
> slow.
> 

Thank you very much for your explanation.

I know vmsplice enables zero-copy operations on user memory pages. When
used together with splice, this syscall can establish references between
user pages and file descriptors. Could this be the optimal solution to
our current performance bottlenecks?

Or are there any other better alternatives already available?

Chenghao
> > 
> > Chenghao
> > > 
> > > ----->o-----
> > > guest_memfd Enlightenment & VMM Documentation
> > > 
> > > Tarun debriefed the community on his upstream presentation regarding the 
> > > initial guest_memfd preservation patch series (currently covering fully 
> > > shared mappings with page-sized folios).
> > > 
> > > Key design and architecture alignments include:
> > > - VM File Association: guest_memfd requires an active 'struct kvm' 
> > >   context to be retrieved. VMMs must preserve the parent VM file 
> > >   alongside guest_memfd, using LUO tokens to re-link them on the 
> > >   incoming kernel path. This sets the stage for future private 
> > >   mapping/secure EPT table tracking.
> > > - Relaxed Fault Logic: The group agreed to drop strict upfront pre-fault 
> > >   checks. Instead, standard runtime page-fault semantics will apply. If 
> > >   a guest page fault occurs post-preservation, it will bubble up via 
> > >   standard KVM_RUN ioctl exits to the VMM, which can safely pause vCPUs 
> > >   and retry the fault post-kexec.
> > > - Centralized VMM Documentation: Pasha and David Matlack proposed 
> > >   creating a centralized guide under live_update/vmm detailing the 
> > >   overall live update flow, timing constraints, and subsystem 
> > >   requirements to assist external QEMU and VMM developers.
> > > 
> > > ----->o-----
> > > Next meeting will be on Monday, June 15 at 8am PDT (UTC-7), everybody is
> > > welcome: https://meet.google.com/rjn-dmzu-hgq
> > > 
> > > Note: I am going to be traveling on June 15th, David Matlack is going to
> > > be hosting it.
> > > 
> > > Topics for the next meeting:
> > > - Presentation of VFIO roadmap (Vipin and David Matlack)
> > > - Status of KHO Xarray development and slab preservation feasibility
> > > - Review of PCI core changes v7 and upstream merge coordination
> > > - IOMMU persistence v3 review feedback
> > > - Detailed review of guest_memfd v2 and VMM interaction documentation
> > > - Review and coordination of LPC 2026 Microconference topic submissions
> > > - later: KHO support for Confidential VMs including page table
> > >   preservation and pinning
> > > - later: versioning support for luod to negotiate
> > > - later: KHO enlightenment for ASI
> > > - later: update on PCI preservation series and next steps
> > > - later: testing methodology to allow downstream consumers to qualify
> > >   that live update works from one version to another
> > > - later: reducing blackout window during live update, including deferred
> > >   struct page initialization
> > > 
> > > Please let me know if you'd like to propose additional topics for
> > > discussion, thank you!
> 
> 
> 
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-07-03  9:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-07 16:06 [Hypervisor Live Update] Notes from June 1, 2026 Pasha Tatashin
     [not found] ` <20260702060202.GA78893@chenghao-pc>
     [not found]   ` <20260702-pay-effect-93be8ab0@mheyne-amazon>
2026-07-03  9:02     ` Chenghao Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox