qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Yi Liu <yi.l.liu@intel.com>
To: "Duan, Zhenzhong" <zhenzhong.duan@intel.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: "alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"clg@redhat.com" <clg@redhat.com>,
	"mst@redhat.com" <mst@redhat.com>,
	"jasowang@redhat.com" <jasowang@redhat.com>,
	"clement.mathieu--drif@eviden.com"
	<clement.mathieu--drif@eviden.com>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>,
	"joao.m.martins@oracle.com" <joao.m.martins@oracle.com>,
	"avihaih@nvidia.com" <avihaih@nvidia.com>,
	"Hao,  Xudong" <xudong.hao@intel.com>,
	"Cabiddu, Giovanni" <giovanni.cabiddu@intel.com>,
	"Gross, Mark" <mark.gross@intel.com>,
	"Van De Ven, Arjan" <arjan.van.de.ven@intel.com>
Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
Date: Wed, 15 Oct 2025 20:57:31 +0800	[thread overview]
Message-ID: <9fabdf4a-a781-491a-bbd2-40e51462b8e7@intel.com> (raw)
In-Reply-To: <IA3PR11MB9136AEE8F0C3A989C964E35092E8A@IA3PR11MB9136.namprd11.prod.outlook.com>

On 2025/10/15 15:48, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>> migration
>>
>> On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>> migration
>>>>
>>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>>> migration
>>>>>>
>>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>> domain,
>>>>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>>>>
>>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>>> backend as below:
>>>>>>>
>>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>>
>>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>> way?
>>>>>
>>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>> with
>>>> migration.
>>>>>
>>>>> When device switches from IOMMU domain to block domain, the whole
>>>> iommu
>>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>>> memory
>>>>> region,
>>>>
>>>> I got this part.
>>>>
>>>>> no matter how many or how large the mappings are in the iommu MR.
>>>>
>>>> hmmm. A more explicit question: does this error happen with 4G VM
>> memory
>>>> as well?
>>>
>>> Coincidently, I remember QAT team reported this issue just with 4G VM
>> memory.
>>
>> ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>> It can be a large range. But it's still not clear to me how can guest
>> map a range more than 4G if VM only has 4G memory.
> 
> It happens when guest switch from DMA domain to block domain, below sequence is triggered:
> 
> vtd_context_device_invalidate
> 	vtd_address_space_sync
> 		vtd_address_space_unmap
> 
> You can see the whole iommu address space is unmapped, it's unrelated to actual mapping in guest.

got it.

>>
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>> maps
>>>> to
>>>>>> 8TB on
>>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>>> unmap_bitmap
>>>>>>> ioctl fails.
>>>>>>>
>>>>>>> There is no such limitation with iommufd backend, but it's still not
>> optimal
>>>>>>> to allocate large bitmap.
>>>>>>>
>>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>>> active
>>>>>>> mapping when migration is active. If migration is not active,
>> unmapping
>>>> the
>>>>>>> whole address space in one go is optimal.
>>>>>>>
>>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>>> ---
>>>>>>>      hw/i386/intel_iommu.c | 42
>>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>      1 file changed, 42 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>> index 83c5e44413..6876dae727 100644
>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>> @@ -37,6 +37,7 @@
>>>>>>>      #include "system/system.h"
>>>>>>>      #include "hw/i386/apic_internal.h"
>>>>>>>      #include "kvm/kvm_i386.h"
>>>>>>> +#include "migration/misc.h"
>>>>>>>      #include "migration/vmstate.h"
>>>>>>>      #include "trace.h"
>>>>>>>
>>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>>          vtd_iommu_unlock(s);
>>>>>>>      }
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Unmapping a large range in one go is not optimal during migration
>>>>>> because
>>>>>>> + * a large dirty bitmap needs to be allocated while there may be only
>>>> small
>>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>> active
>>>>>> mapping.
>>>>>>> + */
>>>>>>> +static void
>> vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>>> *as,
>>>>>>> +
>>>>>> IOMMUNotifier *n)
>>>>>>> +{
>>>>>>> +    const DMAMap *map;
>>>>>>> +    const DMAMap target = {
>>>>>>> +        .iova = n->start,
>>>>>>> +        .size = n->end,
>>>>>>> +    };
>>>>>>> +    IOVATree *tree = as->iova_tree;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>>>> 4KB
>>>>>> or huge
>>>>>>> +     * page size and always a power of 2 in size. So the range of
>>>>>> DMAMap could
>>>>>>> +     * be used for UNMAP notification directly.
>>>>>>> +     */
>>>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>>>
>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>> the
>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>> as well. Are we sure that we need not to unmap anything here? It looks
>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>
>>>>> Not get what failure will happen, could you elaborate?
>>>>> In case of identity domain, IOMMU memory region is disabled, no iommu
>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>> space,
>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>> tracking.
>>>>
>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>> a big range is given to kernel while kernel does not support. So if
>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>> possible when guest (a VM with large size memory) switches from identify
>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>
>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB
>> size.
>>> If guest memory is large enough and lead to a memory region of more than
>> 8TB size,
>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>> 8TB memory,
>>> instead of fixing it in qemu with complex change, I'd suggest to bump below
>> MACRO
>>> value to enlarge the limit in kernel, or switch to use iommufd which doesn't
>> have such limit.
>>
>> This limit shall not affect the usage of device dirty tracking. right?
>> If yes, add something to tell user use iommufd backend is better. e.g
>> if memory size is bigger than the limit of vfio iommu type1's dirty
>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>> user wants migration capability.
> 
> Do you mean just dirty tracking instead of migration, like dirtyrate?
> In that case, there is error print as above, I think that's enough as a hint?

it's not related to diryrate.

> I guess you mean to add a migration blocker if limit is reached? It's hard
> because the limit is only helpful for identity domain, DMA domain in guest
> doesn't have such limit, and we can't know guest's choice of domain type
> of each VFIO device attached.

I meant a blocker to boot QEMU if there is limit. something like below:

	if (VM memory > 8TB && legacy_container_backend && migration_enabled)
		fail the VM boot.

>>
>>> /*
>>>    * Input argument of number of bits to bitmap_set() is unsigned integer,
>> which
>>>    * further casts to signed integer for unaligned multi-bit operation,
>>>    * __bitmap_set().
>>>    * Then maximum bitmap size supported is 2^31 bits divided by 2^3
>> bits/byte,
>>>    * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K
>> page
>>>    * system.
>>>    */
>>> #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
>>> #define DIRTY_BITMAP_SIZE_MAX
>> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>>
>>>>
>>>>>>
>>>>>>> +        IOMMUTLBEvent event;
>>>>>>> +
>>>>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>>>>> +        event.entry.iova = map->iova;
>>>>>>> +        event.entry.addr_mask = map->size;
>>>>>>> +        event.entry.target_as = &address_space_memory;
>>>>>>> +        event.entry.perm = IOMMU_NONE;
>>>>>>> +        /* This field is meaningless for unmap */
>>>>>>> +        event.entry.translated_addr = 0;
>>>>>>> +        memory_region_notify_iommu_one(n, &event);
>>>>>>> +
>>>>>>> +        iova_tree_remove(tree, *map);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>>      /* Unmap the whole range in the notifier's scope. */
>>>>>>>      static void vtd_address_space_unmap(VTDAddressSpace *as,
>>>>>> IOMMUNotifier *n)
>>>>>>>      {
>>>>>>> @@ -4432,6 +4469,11 @@ static void
>>>>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>>>>          IntelIOMMUState *s = as->iommu_state;
>>>>>>>          DMAMap map;
>>>>>>>
>>>>>>> +    if (migration_is_running()) {
>>>>>>
>>>>>> If the range is not big enough, it is still better to unmap in one-go.
>>>>>> right? If so, might add a check on the range here to go to the iova_tee
>>>>>> iteration conditionally.
>>>>>
>>>>> We don't want to ditry track IOVA holes between IOVA ranges because
>> it's
>>>> time consuming and useless work. The hole may be large depending on
>> guest
>>>> behavior.
>>>>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>>>> the exact iova ranges that may be dirty actually.
>>>>
>>>> I see. So this is the optimization. And it also WA the above DMA
>>>> unmap issue as well. right? If so, you may want to call out in the
>>>> commit message.
>>>
>>> Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then
>> the optimization.
>>> I'll rephrase the description and subject.
>>
>> yes. The commit message gives me the impression this is bug fix. While
>> subject is optimization. BTW. perhaps call it as an optimization is
>> clearer since this smells more like an optimization. For fix, I guess
>> you may need to consider the vfio_listener as well.
> 
> Do you have any idea to fix it with vfio_listener?

no good idea.

> My understanding is, it's hard to fix this issue from vfio core, because vfio_listener doesn't
> know the mapping details in the guest, only vIOMMU cached them through DMAMap.

yes. that's why I suggest above fix/WA to avoid it.

Regards,
Yi Liu


  reply	other threads:[~2025-10-15 12:52 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10  2:36 [PATCH 0/5] vfio: relax the vIOMMU check Zhenzhong Duan
2025-09-10  2:36 ` [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap Zhenzhong Duan
2025-09-19  9:30   ` Cédric Le Goater
2025-09-22  3:17     ` Duan, Zhenzhong
2025-09-22  8:27       ` Cédric Le Goater
2025-09-23  2:45         ` Duan, Zhenzhong
2025-09-23  7:06           ` Cédric Le Goater
2025-09-23  9:50             ` Duan, Zhenzhong
2025-09-10  2:36 ` [PATCH 2/5] vfio/iommufd: Query dirty bitmap before DMA unmap Zhenzhong Duan
2025-09-10  2:36 ` [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support Zhenzhong Duan
2025-09-19 10:27   ` Joao Martins
2025-09-22  5:49     ` Duan, Zhenzhong
2025-09-22 16:02       ` Cédric Le Goater
2025-09-22 16:06         ` Joao Martins
2025-09-22 17:01           ` Joao Martins
2025-09-23  2:50             ` Duan, Zhenzhong
2025-09-23  9:17               ` Joao Martins
2025-09-23  9:55                 ` Duan, Zhenzhong
2025-09-23 10:06                   ` Duan, Zhenzhong
2025-09-23  2:47         ` Duan, Zhenzhong
2025-10-09 10:20         ` Duan, Zhenzhong
2025-10-09 12:43           ` Cédric Le Goater
2025-10-10  4:09             ` Duan, Zhenzhong
2025-09-10  2:37 ` [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration Zhenzhong Duan
2025-10-12 10:31   ` Yi Liu
2025-10-13  2:50     ` Duan, Zhenzhong
2025-10-13 12:56       ` Yi Liu
2025-10-14  2:31         ` Duan, Zhenzhong
2025-10-14  6:46           ` Yi Liu
2025-10-15  7:48             ` Duan, Zhenzhong
2025-10-15 12:57               ` Yi Liu [this message]
2025-10-16  8:48                 ` Duan, Zhenzhong
2025-10-16  9:53                   ` Yi Liu
2025-10-16 16:24                     ` Cédric Le Goater
2025-10-17  2:31                       ` Duan, Zhenzhong
2025-09-10  2:37 ` [PATCH 5/5] vfio/migration: Allow live migration with vIOMMU without VFs using device dirty tracking Zhenzhong Duan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9fabdf4a-a781-491a-bbd2-40e51462b8e7@intel.com \
    --to=yi.l.liu@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=arjan.van.de.ven@intel.com \
    --cc=avihaih@nvidia.com \
    --cc=clement.mathieu--drif@eviden.com \
    --cc=clg@redhat.com \
    --cc=eric.auger@redhat.com \
    --cc=giovanni.cabiddu@intel.com \
    --cc=jasowang@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=mark.gross@intel.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=xudong.hao@intel.com \
    --cc=zhenzhong.duan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).