Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Samiullah Khawaja <skhawaja@google.com>
To: Pranjal Shrivastava <praan@google.com>
Cc: Baolu Lu <baolu.lu@linux.intel.com>,
	 David Woodhouse <dwmw2@infradead.org>,
	Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>,
	 Jason Gunthorpe <jgg@ziepe.ca>,
	Robin Murphy <robin.murphy@arm.com>,
	 Kevin Tian <kevin.tian@intel.com>,
	Alex Williamson <alex@shazbot.org>,
	 Shuah Khan <shuah@kernel.org>,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	 kvm@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>,
	 Adithya Jayachandran <ajayachandra@nvidia.com>,
	Parav Pandit <parav@nvidia.com>,
	 Leon Romanovsky <leonro@nvidia.com>,
	William Tu <witu@nvidia.com>,
	 Pratyush Yadav <pratyush@kernel.org>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	 David Matlack <dmatlack@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Chris Li <chrisl@kernel.org>, Vipin Sharma <vipinsh@google.com>,
	 YiFei Zhu <zhuyifei@google.com>
Subject: Re: [PATCH v2 07/16] iommu/vt-d: Implement device and iommu preserve/unpreserve ops
Date: Tue, 19 May 2026 18:26:03 +0000	[thread overview]
Message-ID: <agyMsywb8KqBNu-S@google.com> (raw)
In-Reply-To: <agx2RW_jujXbsiea@google.com>

On Tue, May 19, 2026 at 02:40:05PM +0000, Pranjal Shrivastava wrote:
>On Mon, May 18, 2026 at 08:32:42PM +0000, Samiullah Khawaja wrote:
>> On Fri, May 08, 2026 at 02:36:56AM +0000, Samiullah Khawaja wrote:
>> > On Thu, May 07, 2026 at 02:25:14PM +0800, Baolu Lu wrote:
>> > > On 4/28/26 01:56, Samiullah Khawaja wrote:
>> > > > Add implementation of the device and iommu presevation in a separate
>> > > > file. Also set the device and iommu preserve/unpreserve ops in the
>> > > > struct iommu_ops.
>> > > >
>> > > > During normal shutdown the iommu translation is disabled. Since the root
>> > > > table is preserved during live update, it needs to be cleaned up and the
>> > > > context entries of the unpreserved devices need to be cleared.
>> > >
>> > > This is not related to preserve/unpreserve ops and could be made in a
>> > > separated patch?
>> >
>> > Agreed. I will move this stuff to a separate patch.
>> > >
>> > > >
>> > > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>> > > > ---
>> > > > MAINTAINERS                      |   1 +
>> > > > drivers/iommu/intel/Makefile     |   1 +
>> > > > drivers/iommu/intel/iommu.c      |  52 +++++++++++-
>> > > > drivers/iommu/intel/iommu.h      |  28 +++++++
>> > > > drivers/iommu/intel/liveupdate.c | 139 +++++++++++++++++++++++++++++++
>> > > > drivers/iommu/iommu.c            |  18 ++++
>> > > > include/linux/iommu-liveupdate.h |  10 +++
>> > > > include/linux/iommu.h            |  14 ++++
>> > > > include/linux/kho/abi/iommu.h    |  18 ++++
>> > > > 9 files changed, 277 insertions(+), 4 deletions(-)
>> > > > create mode 100644 drivers/iommu/intel/liveupdate.c
>> > > >
>>
>> [snip]
>> > >
>> > > > +{
>> > > > +	struct context_entry *context;
>> > > > +	int ret;
>> > > > +	int i;
>> > > > +
>> > > > +	for (i = 0; i < ROOT_ENTRY_NR; i++) {
>> > > > +		/*
>> > > > +		 * Alloc the context tables now to make sure the iommu unit is
>> > > > +		 * properly preserved. These might stay unused and wastes around
>> > > > +		 * 32MB max in scalable mode.
>> > > > +		 */
>> > >
>> > > Instead of allocating and preserving context tables for all root entries
>> > > (as noted, can waste up to 32MB), could we restrict this only to the
>> > > entries possibly in use by active PCI devices?
>> >
>> > I think the hotplug devices or VFs created through SR-IOV will be missed
>> > that way. Lets say device A is preserved and the associated iommu is
>> > also preserved. And then a new device B is hotplugged and preserved,
>> > then the context table for that will be missed.
>>
>> Ok I thought about it a little more and basically we have following
>> things to consider when we preserve context tables,
>>
>> - The devices can be hotplugged and preserved, so the context tables of
>>   those need to be preserved if we don't allocate all of them first time
>>   we preserve iommu, as done here.
>> - New context tables can be added (after hotplug) for unpreserved
>>   devices. And if we don't get another iommu preserve call after these
>>   are added, those remain unpreserved, so during shutdown those entries
>>   need to be removed from root table or preserved for simplicity.
>>
>> To solve this we can,
>>
>> 1. Either preserve the new context table when it is added for a preserved
>>   iommu. This can be done in iommu_context_addr(). This is simpler and
>>   no tracking needed.
>>
>> 2. Or track the preserved context tables using a bitmap and then preserve
>>   them incremently whenever a device is preserved. On shutdown during
>>   cleanup, we can clear the entries for unpreserved context tables from
>>   root table.
>>
>> I am inclined towards second option. WDYT?
>
>Thinking out loud here, I agree that shifting away from the 32MB

Wait it should be 2MB (256 buses * 8K(low + Hi)), I miswrote it earlier.
>pre-allocation is the right direction. I'm wondering if we can avoid the
>overhead of introducing a new tracking bitmap (Option 2) altogether?

ROOT_ENTRY_NR is 256 and has 2 context tables for scalable mode, so the
max bitmap will be eight u64. It is per IOMMU, but still reasonable.
>
>Since the IOMMU serialization is a strict dependency for device tracking,
>could we move the context table preservation directly into the device
>level op: intel_iommu_preserve_device()?
>
>Whenever a specific device is preserved on-demand:
>
>1. It queries the parent IOMMU to fetch the allocated context table
>   backing its info->bus.
>
>2. It calls iommu_preserve_page(context) for that table. Because KHO's
>   tracking handles duplicates, this should be fine if multiple devices
>   reside on the same bus...
>
>Regarding Scalable Mode, we could just need a simple check in that path:
>
>
>/* intel_iommu_preserve_device */
>/* Preserve the primary/lower context table backing this bus */
>context = iommu_context_addr(info->iommu, info->bus, 0, 0);
>if (context)
>	iommu_preserve_page(context);
>
>/* If scalable mode is active, preserve the upper context table as well */
>if (sm_supported(info->iommu)) {
>	context = iommu_context_addr(info->iommu, info->bus, 0x80, 0);
>	if (context)
>		iommu_preserve_page(context);
>}
>
>WDYT?

I was thinking on the same lines and was hinting to it in my earlier
reply to Baolu. But then I didn't go for it as with this approach we
introduce a bunch of complexity or lost state depending on the approach
we take.

Basically like I mentioned, the unpreserved context tables need to be
removed from root table during shutdown. We can find out which ones are
unpreserved by walking the preserved device list yes. But it complicates
the overall preservation lifecycle of these context tables by
introducing multiple preserved device list walks. Basically on every
device unpreserve we will have to decide that whether we unpreserve the
context table or some preserved device is still using it. Or if we
choose to keep it preserved even if the device is unpreserved, then
during cleanup we have no way of differentiation whether a context table
was unpreserved and needs clearing in root table.

The bitmap approach is much simpler and only preserves the context
tables for the active devices as suggested by Baolu earlier. Note with
this approach we preserve the context tables of the active devices and
not only the preserved ones, but since these are recreated after kexec
anyway, so not really a waste as compared to preserving the maximum
possible.
>
>>
>> I think we will have to do similar stuff for PASID also down the road to
>> preserve pasid_tables in PASID directory.
>> >
>> > Since we don't track the context_tables that are preserved, there is no
>> > way to incremently preserve the new-ones. Let me look into the behaviour
>> > of KHO, maybe we can make the preserve call idempotent and do these
>> > incrementally.
>> > >
>> > > > +		spin_lock(&iommu->lock);
>> > > > +		context = iommu_context_addr(iommu, i, 0, 1);
>> > > > +		spin_unlock(&iommu->lock);
>> > > > +		if (!context) {
>> > > > +			ret = -ENOMEM;
>> > > > +			goto error;
>> > > > +		}
>>
>[snip]
>
>Thanks,
>Praan

Sami

  reply	other threads:[~2026-05-19 18:26 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-27 17:56 [PATCH v2 00/16] iommu: Add live update state preservation Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 01/16] liveupdate: luo_file: Add internal APIs for file preservation Samiullah Khawaja
2026-05-18 11:40   ` Pranjal Shrivastava
2026-05-18 19:08     ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 02/16] iommu: Implement IOMMU Live update FLB callbacks Samiullah Khawaja
2026-05-01 21:45   ` David Matlack
2026-05-18 11:52     ` Pranjal Shrivastava
2026-05-18 14:10       ` Pratyush Yadav
2026-05-18 15:08         ` Pranjal Shrivastava
2026-05-18 12:33     ` Pranjal Shrivastava
2026-05-18 17:20       ` Samiullah Khawaja
2026-05-18 17:32         ` Pranjal Shrivastava
2026-05-18 17:06     ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 03/16] iommu: Implement IOMMU domain preservation Samiullah Khawaja
2026-05-01 22:08   ` David Matlack
2026-05-04 18:33     ` Samiullah Khawaja
2026-05-18 13:13   ` Pranjal Shrivastava
2026-05-18 18:55     ` Samiullah Khawaja
2026-05-18 21:36       ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 04/16] iommu: Implement device and IOMMU HW preservation Samiullah Khawaja
2026-05-01 22:42   ` David Matlack
2026-05-04 19:06     ` Samiullah Khawaja
2026-05-07  2:07   ` Baolu Lu
2026-05-07 18:47     ` Samiullah Khawaja
2026-05-18 14:01       ` Pranjal Shrivastava
2026-05-18 18:33         ` Samiullah Khawaja
2026-05-18 13:55   ` Pranjal Shrivastava
2026-05-18 18:44     ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 05/16] iommu/pages: Add APIs to preserve/unpreserve/restore iommu pages Samiullah Khawaja
2026-05-18 14:23   ` Pranjal Shrivastava
2026-05-18 17:22     ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 06/16] iommupt: Implement preserve/unpreserve/restore callbacks Samiullah Khawaja
2026-05-07  2:55   ` Baolu Lu
2026-05-07 18:40     ` Samiullah Khawaja
2026-05-19 13:15   ` Pranjal Shrivastava
2026-05-19 17:14     ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 07/16] iommu/vt-d: Implement device and iommu preserve/unpreserve ops Samiullah Khawaja
2026-05-07  6:25   ` Baolu Lu
2026-05-08  2:36     ` Samiullah Khawaja
2026-05-18 20:32       ` Samiullah Khawaja
2026-05-19 14:40         ` Pranjal Shrivastava
2026-05-19 18:26           ` Samiullah Khawaja [this message]
2026-04-27 17:56 ` [PATCH v2 08/16] iommu: Add APIs to get iommu and device preserved state Samiullah Khawaja
2026-05-19 15:52   ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 09/16] iommu/vt-d: Restore IOMMU state and reclaimed domain ids Samiullah Khawaja
2026-05-07  9:05   ` Baolu Lu
2026-05-07 17:35     ` Samiullah Khawaja
2026-05-19 21:46   ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 10/16] iommu: Restore and reattach preserved domains to devices Samiullah Khawaja
2026-05-07 13:54   ` Baolu Lu
2026-05-07 16:52     ` Samiullah Khawaja
2026-04-27 17:56 ` [PATCH v2 11/16] iommu/vt-d: preserve PASID table of preserved device Samiullah Khawaja
2026-05-08  6:05   ` Baolu Lu
2026-05-11 18:45     ` Samiullah Khawaja
2026-05-12 11:32       ` Baolu Lu
2026-05-19 22:35   ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 12/16] iommufd: Implement ioctl to mark HWPT for preservation Samiullah Khawaja
2026-05-19 23:05   ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 13/16] iommufd: Persist iommu hardware pagetables for live update Samiullah Khawaja
2026-05-20  0:00   ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 14/16] iommufd: Add APIs to preserve/unpreserve a vfio cdev Samiullah Khawaja
2026-05-20  0:46   ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 15/16] vfio/pci: Preserve the iommufd state of the " Samiullah Khawaja
2026-05-20  0:57   ` Pranjal Shrivastava
2026-04-27 17:56 ` [PATCH v2 16/16] iommufd/selftest: Add test to verify iommufd preservation Samiullah Khawaja

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agyMsywb8KqBNu-S@google.com \
    --to=skhawaja@google.com \
    --cc=ajayachandra@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex@shazbot.org \
    --cc=baolu.lu@linux.intel.com \
    --cc=chrisl@kernel.org \
    --cc=dmatlack@google.com \
    --cc=dwmw2@infradead.org \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@ziepe.ca \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=parav@nvidia.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=praan@google.com \
    --cc=pratyush@kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=saeedm@nvidia.com \
    --cc=shuah@kernel.org \
    --cc=vipinsh@google.com \
    --cc=will@kernel.org \
    --cc=witu@nvidia.com \
    --cc=zhuyifei@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox