* [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-11 18:31 ` Michael S. Tsirkin
` (2 more replies)
2025-05-02 2:15 ` [PATCH v2 02/20] amd_iommu: Document '-device amd-iommu' common options Alejandro Jimenez
` (21 subsequent siblings)
22 siblings, 3 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
valid and required operation by vIOMMU implementations. However, such
invalidations currently trigger an assertion unless they originate from
device IOTLB invalidations.
Although in recent Linux guests this case is not exercised by the VTD
implementation due to various optimizations, the assertion will be hit
by upcoming AMD vIOMMU changes to support DMA address translation. More
specifically, when running a Linux guest with VFIO passthrough device,
and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
Remove redundant domain flush from attach_device()").
Remove the assertion altogether and adjust the range to ensure it does
not cross notifier boundaries.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
system/memory.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/system/memory.c b/system/memory.c
index 71434e7ad02c..7ad2fc098341 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2021,13 +2021,9 @@ void memory_region_notify_iommu_one(IOMMUNotifier *notifier,
return;
}
- if (notifier->notifier_flags & IOMMU_NOTIFIER_DEVIOTLB_UNMAP) {
- /* Crop (iova, addr_mask) to range */
- tmp.iova = MAX(tmp.iova, notifier->start);
- tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
- } else {
- assert(entry->iova >= notifier->start && entry_end <= notifier->end);
- }
+ /* Crop (iova, addr_mask) to range */
+ tmp.iova = MAX(tmp.iova, notifier->start);
+ tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
if (event->type & notifier->notifier_flags) {
notifier->notify(notifier, &tmp);
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries
2025-05-02 2:15 ` [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
@ 2025-05-11 18:31 ` Michael S. Tsirkin
2025-05-12 8:02 ` David Hildenbrand
2025-06-12 6:54 ` Vasant Hegde
2 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2025-05-11 18:31 UTC (permalink / raw)
To: Alejandro Jimenez
Cc: qemu-devel, pbonzini, richard.henderson, eduardo, peterx, david,
philmd, marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On Fri, May 02, 2025 at 02:15:46AM +0000, Alejandro Jimenez wrote:
> Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
> valid and required operation by vIOMMU implementations. However, such
> invalidations currently trigger an assertion unless they originate from
> device IOTLB invalidations.
>
> Although in recent Linux guests this case is not exercised by the VTD
> implementation due to various optimizations, the assertion will be hit
> by upcoming AMD vIOMMU changes to support DMA address translation. More
> specifically, when running a Linux guest with VFIO passthrough device,
> and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
> Remove redundant domain flush from attach_device()").
>
> Remove the assertion altogether and adjust the range to ensure it does
> not cross notifier boundaries.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Looks legit:
Acked-by: Michael S. Tsirkin <mst@redhat.com>
can we get an ack from memory API supporters?
> ---
> system/memory.c | 10 +++-------
> 1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/system/memory.c b/system/memory.c
> index 71434e7ad02c..7ad2fc098341 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2021,13 +2021,9 @@ void memory_region_notify_iommu_one(IOMMUNotifier *notifier,
> return;
> }
>
> - if (notifier->notifier_flags & IOMMU_NOTIFIER_DEVIOTLB_UNMAP) {
> - /* Crop (iova, addr_mask) to range */
> - tmp.iova = MAX(tmp.iova, notifier->start);
> - tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> - } else {
> - assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> - }
> + /* Crop (iova, addr_mask) to range */
> + tmp.iova = MAX(tmp.iova, notifier->start);
> + tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
>
> if (event->type & notifier->notifier_flags) {
> notifier->notify(notifier, &tmp);
> --
> 2.43.5
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries
2025-05-02 2:15 ` [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
2025-05-11 18:31 ` Michael S. Tsirkin
@ 2025-05-12 8:02 ` David Hildenbrand
2025-05-12 17:29 ` Peter Xu
2025-06-12 6:54 ` Vasant Hegde
2 siblings, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2025-05-12 8:02 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel, peterx
Cc: pbonzini, richard.henderson, eduardo, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 02.05.25 04:15, Alejandro Jimenez wrote:
> Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
> valid and required operation by vIOMMU implementations. However, such
> invalidations currently trigger an assertion unless they originate from
> device IOTLB invalidations.
>
> Although in recent Linux guests this case is not exercised by the VTD
> implementation due to various optimizations, the assertion will be hit
> by upcoming AMD vIOMMU changes to support DMA address translation. More
> specifically, when running a Linux guest with VFIO passthrough device,
> and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
> Remove redundant domain flush from attach_device()").
>
> Remove the assertion altogether and adjust the range to ensure it does
> not cross notifier boundaries.
Looking at the history, we used to have the assert unconditionally, and
made it specific to IOMMU_NOTIFIER_DEVIOTLB_UNMAP in
commit 1804857f19f612f6907832e35599cdb51d4ec764
Author: Eugenio Pérez <eperezma@redhat.com>
Date: Mon Nov 16 17:55:06 2020 +0100
memory: Skip bad range assertion if notifier is DEVIOTLB_UNMAP type
Device IOTLB invalidations can unmap arbitrary ranges, eiter outside of
the memory region or even [0, ~0ULL] for all the space. The assertion
could be hit by a guest, and rhel7 guest effectively hit it.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20201116165506.31315-6-eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
I think this change here is fine, but it would be good getting Pete Xu's opinion.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries
2025-05-12 8:02 ` David Hildenbrand
@ 2025-05-12 17:29 ` Peter Xu
0 siblings, 0 replies; 54+ messages in thread
From: Peter Xu @ 2025-05-12 17:29 UTC (permalink / raw)
To: David Hildenbrand
Cc: Alejandro Jimenez, qemu-devel, pbonzini, richard.henderson,
eduardo, philmd, mst, marcel.apfelbaum, alex.williamson,
vasant.hegde, suravee.suthikulpanit, santosh.shukla, sarunkod,
Wei.Huang2, clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On Mon, May 12, 2025 at 10:02:16AM +0200, David Hildenbrand wrote:
> On 02.05.25 04:15, Alejandro Jimenez wrote:
> > Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
> > valid and required operation by vIOMMU implementations. However, such
> > invalidations currently trigger an assertion unless they originate from
> > device IOTLB invalidations.
> >
> > Although in recent Linux guests this case is not exercised by the VTD
> > implementation due to various optimizations, the assertion will be hit
> > by upcoming AMD vIOMMU changes to support DMA address translation. More
> > specifically, when running a Linux guest with VFIO passthrough device,
> > and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
> > Remove redundant domain flush from attach_device()").
> >
> > Remove the assertion altogether and adjust the range to ensure it does
> > not cross notifier boundaries.
>
> Looking at the history, we used to have the assert unconditionally, and
> made it specific to IOMMU_NOTIFIER_DEVIOTLB_UNMAP in
>
> commit 1804857f19f612f6907832e35599cdb51d4ec764
> Author: Eugenio Pérez <eperezma@redhat.com>
> Date: Mon Nov 16 17:55:06 2020 +0100
>
> memory: Skip bad range assertion if notifier is DEVIOTLB_UNMAP type
> Device IOTLB invalidations can unmap arbitrary ranges, eiter outside of
> the memory region or even [0, ~0ULL] for all the space. The assertion
> could be hit by a guest, and rhel7 guest effectively hit it.
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> Acked-by: Jason Wang <jasowang@redhat.com>
> Message-Id: <20201116165506.31315-6-eperezma@redhat.com>
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
>
> I think this change here is fine, but it would be good getting Pete Xu's opinion.
Agree, that could be an old sanity check only when it used to be guranteed.
>
> Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries
2025-05-02 2:15 ` [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
2025-05-11 18:31 ` Michael S. Tsirkin
2025-05-12 8:02 ` David Hildenbrand
@ 2025-06-12 6:54 ` Vasant Hegde
2025-06-12 21:49 ` Alejandro Jimenez
2 siblings, 1 reply; 54+ messages in thread
From: Vasant Hegde @ 2025-06-12 6:54 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, suravee.suthikulpanit,
santosh.shukla, sarunkod, Wei.Huang2, clement.mathieu--drif,
ethan.milon, joao.m.martins, boris.ostrovsky
Alejandro,
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
> valid and required operation by vIOMMU implementations. However, such
> invalidations currently trigger an assertion unless they originate from
> device IOTLB invalidations.
>
> Although in recent Linux guests this case is not exercised by the VTD
> implementation due to various optimizations, the assertion will be hit
> by upcoming AMD vIOMMU changes to support DMA address translation. More
> specifically, when running a Linux guest with VFIO passthrough device,
> and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
> Remove redundant domain flush from attach_device()").
FYI. Its easy to send invalidate all without above commit (as it does it in
every attach), there are other paths where kernel will still send invalidate
all.. Like detaching/attaching device, etc.
-Vasant
>
> Remove the assertion altogether and adjust the range to ensure it does
> not cross notifier boundaries.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> system/memory.c | 10 +++-------
> 1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/system/memory.c b/system/memory.c
> index 71434e7ad02c..7ad2fc098341 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2021,13 +2021,9 @@ void memory_region_notify_iommu_one(IOMMUNotifier *notifier,
> return;
> }
>
> - if (notifier->notifier_flags & IOMMU_NOTIFIER_DEVIOTLB_UNMAP) {
> - /* Crop (iova, addr_mask) to range */
> - tmp.iova = MAX(tmp.iova, notifier->start);
> - tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> - } else {
> - assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> - }
> + /* Crop (iova, addr_mask) to range */
> + tmp.iova = MAX(tmp.iova, notifier->start);
> + tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
>
> if (event->type & notifier->notifier_flags) {
> notifier->notify(notifier, &tmp);
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries
2025-06-12 6:54 ` Vasant Hegde
@ 2025-06-12 21:49 ` Alejandro Jimenez
0 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-06-12 21:49 UTC (permalink / raw)
To: Vasant Hegde, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, suravee.suthikulpanit,
santosh.shukla, sarunkod, Wei.Huang2, clement.mathieu--drif,
ethan.milon, joao.m.martins, boris.ostrovsky
On 6/12/25 2:54 AM, Vasant Hegde wrote:
> Alejandro,
>
>
> On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
>> Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
>> valid and required operation by vIOMMU implementations. However, such
>> invalidations currently trigger an assertion unless they originate from
>> device IOTLB invalidations.
>>
>> Although in recent Linux guests this case is not exercised by the VTD
>> implementation due to various optimizations, the assertion will be hit
>> by upcoming AMD vIOMMU changes to support DMA address translation. More
>> specifically, when running a Linux guest with VFIO passthrough device,
>> and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
>> Remove redundant domain flush from attach_device()").
>
> FYI. Its easy to send invalidate all without above commit (as it does it in
> every attach), there are other paths where kernel will still send invalidate
> all.. Like detaching/attaching device, etc.
>
ACK, thank you for the comment/confirmation. I just mentioned an easily
reproducible case where it causes a crash during guest boot with just
passing a VF to the guest in the QEMU cmdline, not hoptplug or other
more complex scenarios required. Your argument above is part of why I
chose to remove the assertion completely instead of special casing and
breaking down this range on the amd_iommu.c code handling the invalidation.
Thank you,
Alejandro
>
> -Vasant
>
>
>
>>
>> Remove the assertion altogether and adjust the range to ensure it does
>> not cross notifier boundaries.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>> system/memory.c | 10 +++-------
>> 1 file changed, 3 insertions(+), 7 deletions(-)
>>
>> diff --git a/system/memory.c b/system/memory.c
>> index 71434e7ad02c..7ad2fc098341 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2021,13 +2021,9 @@ void memory_region_notify_iommu_one(IOMMUNotifier *notifier,
>> return;
>> }
>>
>> - if (notifier->notifier_flags & IOMMU_NOTIFIER_DEVIOTLB_UNMAP) {
>> - /* Crop (iova, addr_mask) to range */
>> - tmp.iova = MAX(tmp.iova, notifier->start);
>> - tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
>> - } else {
>> - assert(entry->iova >= notifier->start && entry_end <= notifier->end);
>> - }
>> + /* Crop (iova, addr_mask) to range */
>> + tmp.iova = MAX(tmp.iova, notifier->start);
>> + tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
>>
>> if (event->type & notifier->notifier_flags) {
>> notifier->notify(notifier, &tmp);
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 02/20] amd_iommu: Document '-device amd-iommu' common options
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 03/20] amd_iommu: Reorder device and page table helpers Alejandro Jimenez
` (20 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Document the common parameters used when emulating AMD vIOMMU.
Besides the two amd-iommu specific options: 'xtsup' and 'dma-remap', the
the generic x86 IOMMU option 'intremap' is also included, since it is
typically specified in QEMU command line examples and mailing list threads.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
qemu-options.hx | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/qemu-options.hx b/qemu-options.hx
index dc694a99a30a..198acab48e8e 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1226,6 +1226,29 @@ SRST
``aw-bits=val`` (val between 32 and 64, default depends on machine)
This decides the address width of the IOVA address space.
+``-device amd-iommu[,option=...]``
+ Enables emulation of an AMD-Vi I/O Memory Management Unit (IOMMU).
+ Only available with ``-machine q35``, it supports the following options:
+
+ ``dma-remap=on|off`` (default: off)
+ Support for DMA address translation and access permission checking for
+ guests attaching passthrough devices to paging domains, using the AMD v1
+ I/O Page Table format. This enables ``-device vfio-pci,...`` to work
+ correctly with a guest using the DMA remapping feature of the vIOMMU.
+
+ ``intremap=on|off`` (default: auto)
+ Generic x86 IOMMU functionality implemented by ``amd-iommu`` device.
+ Enables interrupt remapping feature in guests, which is also required to
+ enable x2apic support.
+ Currently only available with ``kernel-irqchip=off|split``, it is
+ automatically enabled when either of those modes is in use, and disabled
+ with ``kernel-irqchip=on``.
+
+ ``xtsup=on|off`` (default: off)
+ Interrupt remapping table supports x2apic mode, enabling the use of
+ 128-bit IRTE format with 32-bit destination field by the guest. Required
+ to support routing interrupts to vCPUs with APIC IDs larger than 0xff.
+
ERST
DEF("name", HAS_ARG, QEMU_OPTION_name,
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 03/20] amd_iommu: Reorder device and page table helpers
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 01/20] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 02/20] amd_iommu: Document '-device amd-iommu' common options Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 04/20] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
` (19 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Move code related to Device Table and Page Table to an earlier location in
the file, where it does not require forward declarations to be used by the
various invalidation functions that will need to query the DTE and walk the
page table in upcoming changes.
This change consist of code movement only, no functional change intended.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 170 ++++++++++++++++++++++----------------------
1 file changed, 85 insertions(+), 85 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 2cf7e24a21d8..9e500121f6e8 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -407,6 +407,91 @@ static void amdvi_completion_wait(AMDVIState *s, uint64_t *cmd)
trace_amdvi_completion_wait(addr, data);
}
+static inline uint64_t amdvi_get_perms(uint64_t entry)
+{
+ return (entry & (AMDVI_DEV_PERM_READ | AMDVI_DEV_PERM_WRITE)) >>
+ AMDVI_DEV_PERM_SHIFT;
+}
+
+/* validate that reserved bits are honoured */
+static bool amdvi_validate_dte(AMDVIState *s, uint16_t devid,
+ uint64_t *dte)
+{
+ if ((dte[0] & AMDVI_DTE_LOWER_QUAD_RESERVED)
+ || (dte[1] & AMDVI_DTE_MIDDLE_QUAD_RESERVED)
+ || (dte[2] & AMDVI_DTE_UPPER_QUAD_RESERVED) || dte[3]) {
+ amdvi_log_illegaldevtab_error(s, devid,
+ s->devtab +
+ devid * AMDVI_DEVTAB_ENTRY_SIZE, 0);
+ return false;
+ }
+
+ return true;
+}
+
+/* get a device table entry given the devid */
+static bool amdvi_get_dte(AMDVIState *s, int devid, uint64_t *entry)
+{
+ uint32_t offset = devid * AMDVI_DEVTAB_ENTRY_SIZE;
+
+ if (dma_memory_read(&address_space_memory, s->devtab + offset, entry,
+ AMDVI_DEVTAB_ENTRY_SIZE, MEMTXATTRS_UNSPECIFIED)) {
+ trace_amdvi_dte_get_fail(s->devtab, offset);
+ /* log error accessing dte */
+ amdvi_log_devtab_error(s, devid, s->devtab + offset, 0);
+ return false;
+ }
+
+ *entry = le64_to_cpu(*entry);
+ if (!amdvi_validate_dte(s, devid, entry)) {
+ trace_amdvi_invalid_dte(entry[0]);
+ return false;
+ }
+
+ return true;
+}
+
+/* get pte translation mode */
+static inline uint8_t get_pte_translation_mode(uint64_t pte)
+{
+ return (pte >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
+}
+
+static inline uint64_t pte_override_page_mask(uint64_t pte)
+{
+ uint8_t page_mask = 13;
+ uint64_t addr = (pte & AMDVI_DEV_PT_ROOT_MASK) >> 12;
+ /* find the first zero bit */
+ while (addr & 1) {
+ page_mask++;
+ addr = addr >> 1;
+ }
+
+ return ~((1ULL << page_mask) - 1);
+}
+
+static inline uint64_t pte_get_page_mask(uint64_t oldlevel)
+{
+ return ~((1UL << ((oldlevel * 9) + 3)) - 1);
+}
+
+static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
+ uint16_t devid)
+{
+ uint64_t pte;
+
+ if (dma_memory_read(&address_space_memory, pte_addr,
+ &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
+ trace_amdvi_get_pte_hwerror(pte_addr);
+ amdvi_log_pagetab_error(s, devid, pte_addr, 0);
+ pte = 0;
+ return pte;
+ }
+
+ pte = le64_to_cpu(pte);
+ return pte;
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
@@ -838,91 +923,6 @@ static void amdvi_mmio_write(void *opaque, hwaddr addr, uint64_t val,
}
}
-static inline uint64_t amdvi_get_perms(uint64_t entry)
-{
- return (entry & (AMDVI_DEV_PERM_READ | AMDVI_DEV_PERM_WRITE)) >>
- AMDVI_DEV_PERM_SHIFT;
-}
-
-/* validate that reserved bits are honoured */
-static bool amdvi_validate_dte(AMDVIState *s, uint16_t devid,
- uint64_t *dte)
-{
- if ((dte[0] & AMDVI_DTE_LOWER_QUAD_RESERVED)
- || (dte[1] & AMDVI_DTE_MIDDLE_QUAD_RESERVED)
- || (dte[2] & AMDVI_DTE_UPPER_QUAD_RESERVED) || dte[3]) {
- amdvi_log_illegaldevtab_error(s, devid,
- s->devtab +
- devid * AMDVI_DEVTAB_ENTRY_SIZE, 0);
- return false;
- }
-
- return true;
-}
-
-/* get a device table entry given the devid */
-static bool amdvi_get_dte(AMDVIState *s, int devid, uint64_t *entry)
-{
- uint32_t offset = devid * AMDVI_DEVTAB_ENTRY_SIZE;
-
- if (dma_memory_read(&address_space_memory, s->devtab + offset, entry,
- AMDVI_DEVTAB_ENTRY_SIZE, MEMTXATTRS_UNSPECIFIED)) {
- trace_amdvi_dte_get_fail(s->devtab, offset);
- /* log error accessing dte */
- amdvi_log_devtab_error(s, devid, s->devtab + offset, 0);
- return false;
- }
-
- *entry = le64_to_cpu(*entry);
- if (!amdvi_validate_dte(s, devid, entry)) {
- trace_amdvi_invalid_dte(entry[0]);
- return false;
- }
-
- return true;
-}
-
-/* get pte translation mode */
-static inline uint8_t get_pte_translation_mode(uint64_t pte)
-{
- return (pte >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
-}
-
-static inline uint64_t pte_override_page_mask(uint64_t pte)
-{
- uint8_t page_mask = 13;
- uint64_t addr = (pte & AMDVI_DEV_PT_ROOT_MASK) >> 12;
- /* find the first zero bit */
- while (addr & 1) {
- page_mask++;
- addr = addr >> 1;
- }
-
- return ~((1ULL << page_mask) - 1);
-}
-
-static inline uint64_t pte_get_page_mask(uint64_t oldlevel)
-{
- return ~((1UL << ((oldlevel * 9) + 3)) - 1);
-}
-
-static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
- uint16_t devid)
-{
- uint64_t pte;
-
- if (dma_memory_read(&address_space_memory, pte_addr,
- &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
- trace_amdvi_get_pte_hwerror(pte_addr);
- amdvi_log_pagetab_error(s, devid, pte_addr, 0);
- pte = 0;
- return pte;
- }
-
- pte = le64_to_cpu(pte);
- return pte;
-}
-
static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
IOMMUTLBEntry *ret, unsigned perms,
hwaddr addr)
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 04/20] amd_iommu: Helper to decode size of page invalidation command
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (2 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 03/20] amd_iommu: Reorder device and page table helpers Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
` (18 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
The size of the region to invalidate depends on the S bit and address
encoded in the command. Add a helper to extract this information, which
will be used to sync shadow page tables in upcoming changes.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 34 ++++++++++++++++++++++++++++++++++
hw/i386/amd_iommu.h | 4 ++++
2 files changed, 38 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 9e500121f6e8..dff6f04c8651 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -545,6 +545,40 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
return entry->domid == domid;
}
+/*
+ * Helper to decode the size of the range to invalidate encoded in the
+ * INVALIDATE_IOMMU_PAGES Command format.
+ * The size of the region to invalidate depends on the S bit and address.
+ * S bit value:
+ * 0 : Invalidation size is 4 Kbytes.
+ * 1 : Invalidation size is determined by first zero bit in the address
+ * starting from Address[12].
+ *
+ * In the AMD IOMMU Linux driver, an invalidation command with address
+ * ((1 << 63) - 1) is sent when intending to clear the entire cache.
+ * However, Table 14: Example Page Size Encodings shows that an address of
+ * ((1ULL << 51) - 1) encodes the entire cache, so effectively any address with
+ * first zero at bit 51 or larger is a request to invalidate the entire address
+ * space.
+ */
+static uint64_t __attribute__((unused))
+amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
+{
+ uint64_t size = AMDVI_PAGE_SIZE;
+ uint8_t fzbit = 0;
+
+ if (flags & AMDVI_CMD_INVAL_IOMMU_PAGES_S) {
+ fzbit = cto64(addr | 0xFFF);
+
+ if (fzbit >= 51) {
+ size = AMDVI_INV_ALL_PAGES;
+ } else {
+ size = 1ULL << (fzbit + 1);
+ }
+ }
+ return size;
+}
+
/* we don't have devid - we can't remove pages by address */
static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
{
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 28125130c6fc..6f35b0595054 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -123,6 +123,10 @@
#define AMDVI_CMD_COMPLETE_PPR_REQUEST 0x07
#define AMDVI_CMD_INVAL_AMDVI_ALL 0x08
+
+#define AMDVI_CMD_INVAL_IOMMU_PAGES_S (1ULL << 0)
+#define AMDVI_INV_ALL_PAGES (1ULL << 52)
+
#define AMDVI_DEVTAB_ENTRY_SIZE 32
/* Device table entry bits 0:63 */
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (3 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 04/20] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-12 6:45 ` Sairaj Kodilkar
2025-05-20 10:18 ` Ethan MILON
2025-05-02 2:15 ` [PATCH v2 06/20] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
` (17 subsequent siblings)
22 siblings, 2 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
common operation required for syncing the shadow page tables. Implement a
helper to do it and check for common error conditions.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 40 insertions(+), 5 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index dff6f04c8651..5322a614f5d6 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -77,6 +77,18 @@ typedef struct AMDVIIOTLBEntry {
uint64_t page_mask; /* physical page size */
} AMDVIIOTLBEntry;
+/*
+ * These 'fault' reasons have an overloaded meaning since they are not only
+ * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
+ * IOMMU specification, but are also used to signal internal errors in the
+ * emulation code.
+ */
+typedef enum AMDVIFaultReason {
+ AMDVI_FR_DTE_RTR_ERR = 1, /* Failure to retrieve DTE */
+ AMDVI_FR_DTE_V, /* DTE[V] = 0 */
+ AMDVI_FR_DTE_TV, /* DTE[TV] = 0 */
+} AMDVIFaultReason;
+
uint64_t amdvi_extended_feature_register(AMDVIState *s)
{
uint64_t feature = AMDVI_DEFAULT_EXT_FEATURES;
@@ -492,6 +504,28 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
return pte;
}
+static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
+{
+ uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
+ AMDVIState *s = as->iommu_state;
+
+ if (!amdvi_get_dte(s, devid, dte)) {
+ /* Unable to retrieve DTE for devid */
+ return -AMDVI_FR_DTE_RTR_ERR;
+ }
+
+ if (!(dte[0] & AMDVI_DEV_VALID)) {
+ /* DTE[V] not set, address is passed untranslated for devid */
+ return -AMDVI_FR_DTE_V;
+ }
+
+ if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
+ /* DTE[TV] not set, host page table not valid for devid */
+ return -AMDVI_FR_DTE_TV;
+ }
+ return 0;
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
@@ -1024,6 +1058,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
uint64_t entry[4];
+ int dte_ret;
if (iotlb_entry) {
trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
@@ -1035,13 +1070,13 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
return;
}
- if (!amdvi_get_dte(s, devid, entry)) {
- return;
- }
+ dte_ret = amdvi_as_to_dte(as, entry);
- /* devices with V = 0 are not translated */
- if (!(entry[0] & AMDVI_DEV_VALID)) {
+ if (dte_ret == -AMDVI_FR_DTE_V) {
+ /* DTE[V]=0, address is passed untranslated */
goto out;
+ } else if (dte_ret == -AMDVI_FR_DTE_TV) {
+ return;
}
amdvi_page_walk(as, entry, ret,
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE
2025-05-02 2:15 ` [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
@ 2025-05-12 6:45 ` Sairaj Kodilkar
2025-05-14 20:23 ` Alejandro Jimenez
2025-05-20 10:18 ` Ethan MILON
1 sibling, 1 reply; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-12 6:45 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
> common operation required for syncing the shadow page tables. Implement a
> helper to do it and check for common error conditions.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index dff6f04c8651..5322a614f5d6 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -77,6 +77,18 @@ typedef struct AMDVIIOTLBEntry {
> uint64_t page_mask; /* physical page size */
> } AMDVIIOTLBEntry;
>
> +/*
> + * These 'fault' reasons have an overloaded meaning since they are not only
> + * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
> + * IOMMU specification, but are also used to signal internal errors in the
> + * emulation code.
> + */
> +typedef enum AMDVIFaultReason {
> + AMDVI_FR_DTE_RTR_ERR = 1, /* Failure to retrieve DTE */
> + AMDVI_FR_DTE_V, /* DTE[V] = 0 */
> + AMDVI_FR_DTE_TV, /* DTE[TV] = 0 */
> +} AMDVIFaultReason;
> +
> uint64_t amdvi_extended_feature_register(AMDVIState *s)
> {
> uint64_t feature = AMDVI_DEFAULT_EXT_FEATURES;
> @@ -492,6 +504,28 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
> return pte;
> }
>
> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
> +{
> + uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
> + AMDVIState *s = as->iommu_state;
> +
> + if (!amdvi_get_dte(s, devid, dte)) {
> + /* Unable to retrieve DTE for devid */
> + return -AMDVI_FR_DTE_RTR_ERR;
> + }
> +
> + if (!(dte[0] & AMDVI_DEV_VALID)) {
> + /* DTE[V] not set, address is passed untranslated for devid */
> + return -AMDVI_FR_DTE_V;
> + }
> +
> + if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
> + /* DTE[TV] not set, host page table not valid for devid */
> + return -AMDVI_FR_DTE_TV;
> + }
> + return 0;
> +}
> +
> /* log error without aborting since linux seems to be using reserved bits */
> static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
> {
> @@ -1024,6 +1058,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
> uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
> AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
> uint64_t entry[4];
> + int dte_ret;
>
> if (iotlb_entry) {
> trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
> @@ -1035,13 +1070,13 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
> return;
> }
>
> - if (!amdvi_get_dte(s, devid, entry)) {
> - return;
> - }
> + dte_ret = amdvi_as_to_dte(as, entry);
>
> - /* devices with V = 0 are not translated */
> - if (!(entry[0] & AMDVI_DEV_VALID)) {
> + if (dte_ret == -AMDVI_FR_DTE_V) {
> + /* DTE[V]=0, address is passed untranslated */
> goto out;
> + } else if (dte_ret == -AMDVI_FR_DTE_TV) {
> + return;
> }
>
Hi Alejandro,
You missed to handle -AMDVI_FR_DTE_RTR_ERR.
Regards
Sairaj Kodilkar
> amdvi_page_walk(as, entry, ret,
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE
2025-05-12 6:45 ` Sairaj Kodilkar
@ 2025-05-14 20:23 ` Alejandro Jimenez
0 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-14 20:23 UTC (permalink / raw)
To: Sairaj Kodilkar, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/12/25 2:45 AM, Sairaj Kodilkar wrote:
>
>
> On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
>> @@ -1035,13 +1070,13 @@ static void
>> amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
>> return;
>> }
>> - if (!amdvi_get_dte(s, devid, entry)) {
>> - return;
>> - }
>> + dte_ret = amdvi_as_to_dte(as, entry);
>> - /* devices with V = 0 are not translated */
>> - if (!(entry[0] & AMDVI_DEV_VALID)) {
>> + if (dte_ret == -AMDVI_FR_DTE_V) {
>> + /* DTE[V]=0, address is passed untranslated */
>> goto out;
>> + } else if (dte_ret == -AMDVI_FR_DTE_TV) {
>> + return;
>> }
>>
>
> Hi Alejandro,
>
> You missed to handle -AMDVI_FR_DTE_RTR_ERR.
Good catch. I'll replace that block with:
+ dte_ret = amdvi_as_to_dte(as, entry);
- /* devices with V = 0 are not translated */
- if (!(entry[0] & AMDVI_DEV_VALID)) {
- goto out;
+ if (dte_ret < 0) {
+ if (dte_ret == -AMDVI_FR_DTE_V) {
+ /* DTE[V]=0, address is passed untranslated */
+ goto out;
+ }
+ return;
}
Alejandro
>
> Regards
> Sairaj Kodilkar
>
>> amdvi_page_walk(as, entry, ret,
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE
2025-05-02 2:15 ` [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
2025-05-12 6:45 ` Sairaj Kodilkar
@ 2025-05-20 10:18 ` Ethan MILON
2025-05-21 14:49 ` Alejandro Jimenez
1 sibling, 1 reply; 54+ messages in thread
From: Ethan MILON @ 2025-05-20 10:18 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel@nongnu.org
Hi,
On 5/2/25 4:15 AM, Alejandro Jimenez wrote:
> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
>
>
> Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
> common operation required for syncing the shadow page tables. Implement a
> helper to do it and check for common error conditions.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index dff6f04c8651..5322a614f5d6 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -77,6 +77,18 @@ typedef struct AMDVIIOTLBEntry {
> uint64_t page_mask; /* physical page size */
> } AMDVIIOTLBEntry;
>
> +/*
> + * These 'fault' reasons have an overloaded meaning since they are not only
> + * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
> + * IOMMU specification, but are also used to signal internal errors in the
> + * emulation code.
> + */
> +typedef enum AMDVIFaultReason {
> + AMDVI_FR_DTE_RTR_ERR = 1, /* Failure to retrieve DTE */
> + AMDVI_FR_DTE_V, /* DTE[V] = 0 */
> + AMDVI_FR_DTE_TV, /* DTE[TV] = 0 */
> +} AMDVIFaultReason;
> +
> uint64_t amdvi_extended_feature_register(AMDVIState *s)
> {
> uint64_t feature = AMDVI_DEFAULT_EXT_FEATURES;
> @@ -492,6 +504,28 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
> return pte;
> }
>
> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
> +{
> + uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
> + AMDVIState *s = as->iommu_state;
> +
> + if (!amdvi_get_dte(s, devid, dte)) {
> + /* Unable to retrieve DTE for devid */
> + return -AMDVI_FR_DTE_RTR_ERR;
> + }
> +
> + if (!(dte[0] & AMDVI_DEV_VALID)) {
> + /* DTE[V] not set, address is passed untranslated for devid */
> + return -AMDVI_FR_DTE_V;
> + }
> +
> + if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
> + /* DTE[TV] not set, host page table not valid for devid */
> + return -AMDVI_FR_DTE_TV;
> + }
> + return 0;
> +}
> +
I'm not sure the new amdvi_as_to_dte() helper adds much. It just wraps a
few checks and makes it harder to report faults properly in the future.
Is there a reason this couldn't be handled inline?
> /* log error without aborting since linux seems to be using reserved bits */
> static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
> {
> @@ -1024,6 +1058,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
> uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
> AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
> uint64_t entry[4];
> + int dte_ret;
>
> if (iotlb_entry) {
> trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
> @@ -1035,13 +1070,13 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
> return;
> }
>
> - if (!amdvi_get_dte(s, devid, entry)) {
> - return;
> - }
> + dte_ret = amdvi_as_to_dte(as, entry);
>
> - /* devices with V = 0 are not translated */
> - if (!(entry[0] & AMDVI_DEV_VALID)) {
> + if (dte_ret == -AMDVI_FR_DTE_V) {
> + /* DTE[V]=0, address is passed untranslated */
> goto out;
> + } else if (dte_ret == -AMDVI_FR_DTE_TV) {
> + return;
> }
>
> amdvi_page_walk(as, entry, ret,
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE
2025-05-20 10:18 ` Ethan MILON
@ 2025-05-21 14:49 ` Alejandro Jimenez
2025-06-12 8:31 ` Ethan MILON
0 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-21 14:49 UTC (permalink / raw)
To: Ethan MILON, qemu-devel@nongnu.org
Hi Ethan,
On 5/20/25 6:18 AM, Ethan MILON wrote:
> Hi,
>
> On 5/2/25 4:15 AM, Alejandro Jimenez wrote:
>> Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
>> common operation required for syncing the shadow page tables. Implement a
>> helper to do it and check for common error conditions.
>>
>> +/*
>> + * These 'fault' reasons have an overloaded meaning since they are not only
>> + * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
>> + * IOMMU specification, but are also used to signal internal errors in the
>> + * emulation code.
>> + */
>> +typedef enum AMDVIFaultReason {
>> + AMDVI_FR_DTE_RTR_ERR = 1, /* Failure to retrieve DTE */
>> + AMDVI_FR_DTE_V, /* DTE[V] = 0 */
>> + AMDVI_FR_DTE_TV, /* DTE[TV] = 0 */
>> +} AMDVIFaultReason;
>> +
>>
>> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
>> +{
>> + uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
>> + AMDVIState *s = as->iommu_state;
>> +
>> + if (!amdvi_get_dte(s, devid, dte)) {
>> + /* Unable to retrieve DTE for devid */
>> + return -AMDVI_FR_DTE_RTR_ERR;
>> + }
>> +
>> + if (!(dte[0] & AMDVI_DEV_VALID)) {
>> + /* DTE[V] not set, address is passed untranslated for devid */
>> + return -AMDVI_FR_DTE_V;
>> + }
>> +
>> + if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
>> + /* DTE[TV] not set, host page table not valid for devid */
>> + return -AMDVI_FR_DTE_TV;
>> + }
>> + return 0;
>> +}
>> +
>
> I'm not sure the new amdvi_as_to_dte() helper adds much. It just wraps a
> few checks and makes it harder to report faults properly in the future.
I am afraid I don't understand this argument. How does it make it
harder? It returns 0 on success, and a negative value in case of error,
and the error type can be checked and handled as needed by the caller.
The current implementation checks for 3 basic possible failures and maps
them to errors:
AMDVI_FR_DTE_RTR_ERR --> This generally means something is very wrong
AMDVI_FR_DTE_V and/or AMDVI_FR_DTE_TV i.e. V=1, TV=1 --> This is a basic
condition that multiple DTE fields require to be considered valid.
Every time we need to retrieve a DTE (for current and future features)
we need to minimally check for those conditions.
> Is there a reason this couldn't be handled inline?
Discounting future uses, just by the end of the series you have 5
different callers of this function, that is a lot of code duplication...
Thank you,
Alejandro
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE
2025-05-21 14:49 ` Alejandro Jimenez
@ 2025-06-12 8:31 ` Ethan MILON
0 siblings, 0 replies; 54+ messages in thread
From: Ethan MILON @ 2025-06-12 8:31 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel@nongnu.org
Hi,
On 5/21/25 4:49 PM, Alejandro Jimenez wrote:
> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
>
>
> Hi Ethan,
>
> On 5/20/25 6:18 AM, Ethan MILON wrote:
>> Hi,
>>
>> On 5/2/25 4:15 AM, Alejandro Jimenez wrote:
>
>>> Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
>>> common operation required for syncing the shadow page tables. Implement a
>>> helper to do it and check for common error conditions.
>>>
>
>>> +/*
>>> + * These 'fault' reasons have an overloaded meaning since they are not only
>>> + * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
>>> + * IOMMU specification, but are also used to signal internal errors in the
>>> + * emulation code.
>>> + */
>>> +typedef enum AMDVIFaultReason {
>>> + AMDVI_FR_DTE_RTR_ERR = 1, /* Failure to retrieve DTE */
>>> + AMDVI_FR_DTE_V, /* DTE[V] = 0 */
>>> + AMDVI_FR_DTE_TV, /* DTE[TV] = 0 */
>>> +} AMDVIFaultReason;
>>> +
>
>>>
>>> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
>>> +{
>>> + uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
>>> + AMDVIState *s = as->iommu_state;
>>> +
>>> + if (!amdvi_get_dte(s, devid, dte)) {
>>> + /* Unable to retrieve DTE for devid */
>>> + return -AMDVI_FR_DTE_RTR_ERR;
>>> + }
>>> +
>>> + if (!(dte[0] & AMDVI_DEV_VALID)) {
>>> + /* DTE[V] not set, address is passed untranslated for devid */
>>> + return -AMDVI_FR_DTE_V;
>>> + }
>>> +
>>> + if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
>>> + /* DTE[TV] not set, host page table not valid for devid */
>>> + return -AMDVI_FR_DTE_TV;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>
>> I'm not sure the new amdvi_as_to_dte() helper adds much. It just wraps a
>> few checks and makes it harder to report faults properly in the future.
>
> I am afraid I don't understand this argument. How does it make it
> harder? It returns 0 on success, and a negative value in case of error,
> and the error type can be checked and handled as needed by the caller.
>
You're right, I initially thought the amdvi_as_to_dte() helper might
make proper fault reporting harder, but on second look, it shouldn't be
a problem.
> The current implementation checks for 3 basic possible failures and maps
> them to errors:
>
> AMDVI_FR_DTE_RTR_ERR --> This generally means something is very wrong
>
For AMDVI_FR_DTE_RTR_ERR, would it make sense to be more specific? Right
now, it could mean either a hardware issue or a malformed DTE. Or should
it be split up in a future patch with proper fault reporting implemented?
> AMDVI_FR_DTE_V and/or AMDVI_FR_DTE_TV i.e. V=1, TV=1 --> This is a basic
> condition that multiple DTE fields require to be considered valid.
>
> Every time we need to retrieve a DTE (for current and future features)
> we need to minimally check for those conditions.
>
>> Is there a reason this couldn't be handled inline?
>
> Discounting future uses, just by the end of the series you have 5
> different callers of this function, that is a lot of code duplication...
>
Ah, my mistake, I didn’t notice all the callers.
Thank,
Ethan
> Thank you,
> Alejandro
>
>
>
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 06/20] amd_iommu: Return an error when unable to read PTE from guest memory
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (4 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 05/20] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-06-12 10:37 ` Vasant Hegde
2025-05-02 2:15 ` [PATCH v2 07/20] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
` (16 subsequent siblings)
22 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Make amdvi_get_pte_entry() return an error value (-1) in cases where the
memory read fails, versus the current return of 0 to indicate failure.
The reason is that 0 is also a valid PTE value, and it is useful to know
when a PTE points to memory that is zero i.e. the guest unmapped the
page.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 5322a614f5d6..698967cc1a88 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -496,7 +496,7 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
&pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
trace_amdvi_get_pte_hwerror(pte_addr);
amdvi_log_pagetab_error(s, devid, pte_addr, 0);
- pte = 0;
+ pte = (uint64_t)-1;
return pte;
}
@@ -1024,7 +1024,7 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
/* add offset and load pte */
pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
- if (!pte) {
+ if (!pte || (pte == (uint64_t)-1)) {
return;
}
oldlevel = level;
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 06/20] amd_iommu: Return an error when unable to read PTE from guest memory
2025-05-02 2:15 ` [PATCH v2 06/20] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
@ 2025-06-12 10:37 ` Vasant Hegde
2025-06-13 17:44 ` Alejandro Jimenez
0 siblings, 1 reply; 54+ messages in thread
From: Vasant Hegde @ 2025-06-12 10:37 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, suravee.suthikulpanit,
santosh.shukla, sarunkod, Wei.Huang2, clement.mathieu--drif,
ethan.milon, joao.m.martins, boris.ostrovsky
Alejandro,
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> Make amdvi_get_pte_entry() return an error value (-1) in cases where the
> memory read fails, versus the current return of 0 to indicate failure.
> The reason is that 0 is also a valid PTE value, and it is useful to know
If PTE is valid then at least PR bit will be set. So it will not be zero right?
-Vasant
> when a PTE points to memory that is zero i.e. the guest unmapped the
> page.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 5322a614f5d6..698967cc1a88 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -496,7 +496,7 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
> &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
> trace_amdvi_get_pte_hwerror(pte_addr);
> amdvi_log_pagetab_error(s, devid, pte_addr, 0);
> - pte = 0;
> + pte = (uint64_t)-1;
> return pte;
> }
>
> @@ -1024,7 +1024,7 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
> /* add offset and load pte */
> pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
> pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
> - if (!pte) {
> + if (!pte || (pte == (uint64_t)-1)) {
> return;
> }
> oldlevel = level;
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 06/20] amd_iommu: Return an error when unable to read PTE from guest memory
2025-06-12 10:37 ` Vasant Hegde
@ 2025-06-13 17:44 ` Alejandro Jimenez
0 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-06-13 17:44 UTC (permalink / raw)
To: Vasant Hegde, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, suravee.suthikulpanit,
santosh.shukla, sarunkod, Wei.Huang2, clement.mathieu--drif,
ethan.milon, joao.m.martins, boris.ostrovsky
On 6/12/25 6:37 AM, Vasant Hegde wrote:
> Alejandro,
>
>
> On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
>> Make amdvi_get_pte_entry() return an error value (-1) in cases where the
>> memory read fails, versus the current return of 0 to indicate failure.
>> The reason is that 0 is also a valid PTE value, and it is useful to know
>
>
> If PTE is valid then at least PR bit will be set. So it will not be zero right?
>
I can change the wording in the last sentence to something like:
"The reason is that 0 is also a valid value to have stored in the PTE in
guest memory i.e. the guest does not have a mapping"
What I am trying to convey is that amdvi_get_pte_entry() should return
three different states:
-1: Error attempting to read the guest memory that contains the PTE i.e.
dma_memory_read() returned an error. This has likely nothing to do with
the guest and signals a problem with the emulation (e.g. problem with
QEMU memory backend)
0: The guest memory containing the PTE was successfully read, but there
is currently no mapping in that PTE i.e. *pte==0
>0: The guest memory containing the PTE was successfully read, and
there is currently a mapping i.e. *pte== <hopefully a valid IO page
table entry>
Before the change, amdvi_get_pte_entry() returned 0 for both an error
and for empty PTEs, but for the page walker implementation later in the
patchset we need to differentiate between those two conditions.
Thank you,
Alejandro
> -Vasant
>
>
>> when a PTE points to memory that is zero i.e. the guest unmapped the
>> page.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>> hw/i386/amd_iommu.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index 5322a614f5d6..698967cc1a88 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -496,7 +496,7 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
>> &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
>> trace_amdvi_get_pte_hwerror(pte_addr);
>> amdvi_log_pagetab_error(s, devid, pte_addr, 0);
>> - pte = 0;
>> + pte = (uint64_t)-1;
>> return pte;
>> }
>>
>> @@ -1024,7 +1024,7 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
>> /* add offset and load pte */
>> pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
>> pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
>> - if (!pte) {
>> + if (!pte || (pte == (uint64_t)-1)) {
>> return;
>> }
>> oldlevel = level;
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 07/20] amd_iommu: Add helpers to walk AMD v1 Page Table format
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (5 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 06/20] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 08/20] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
` (15 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
The current amdvi_page_walk() is designed to be called by the replay()
method. Rather than drastically altering it, introduce helpers to fetch
guest PTEs that will be used by a page walker implementation.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 123 ++++++++++++++++++++++++++++++++++++++++++++
hw/i386/amd_iommu.h | 42 +++++++++++++++
2 files changed, 165 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 698967cc1a88..44bf8ebe303e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -87,6 +87,8 @@ typedef enum AMDVIFaultReason {
AMDVI_FR_DTE_RTR_ERR = 1, /* Failure to retrieve DTE */
AMDVI_FR_DTE_V, /* DTE[V] = 0 */
AMDVI_FR_DTE_TV, /* DTE[TV] = 0 */
+ AMDVI_FR_PT_ROOT_INV, /* Page Table Root ptr invalid */
+ AMDVI_FR_PT_ENTRY_INV, /* Failure to read PTE from guest memory */
} AMDVIFaultReason;
uint64_t amdvi_extended_feature_register(AMDVIState *s)
@@ -526,6 +528,127 @@ static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
return 0;
}
+/*
+ * For a PTE encoding a large page, return the page size it encodes as described
+ * by the AMD IOMMU Specification Table 14: Example Page Size Encodings.
+ * No need to adjust the value of the PTE to point to the first PTE in the large
+ * page since the encoding guarantees all "base" PTEs in the large page are the
+ * same.
+ */
+static uint64_t large_pte_page_size(uint64_t pte)
+{
+ assert(PTE_NEXT_LEVEL(pte) == 7);
+
+ /* Determine size of the large/contiguous page encoded in the PTE */
+ return PTE_LARGE_PAGE_SIZE(pte);
+}
+
+/*
+ * Helper function to fetch a PTE using AMD v1 pgtable format.
+ * On successful page walk, returns 0 and pte parameter points to a valid PTE.
+ * On failure, returns:
+ * -AMDVI_FR_PT_ROOT_INV: A page walk is not possible due to conditions like DTE
+ * with invalid permissions, Page Table Root can not be read from DTE, or a
+ * larger IOVA than supported by page table level encoded in DTE[Mode].
+ * -AMDVI_FR_PT_ENTRY_INV: A PTE could not be read from guest memory during a
+ * page table walk. This means that the DTE has valid data, but one of the
+ * lower level entries in the Page Table could not be read.
+ */
+static int __attribute__((unused))
+fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
+ hwaddr *page_size)
+{
+ IOMMUAccessFlags perms = amdvi_get_perms(dte);
+
+ uint8_t level, mode;
+ uint64_t pte_addr;
+
+ *pte = dte;
+ *page_size = 0;
+
+ if (perms == IOMMU_NONE) {
+ return -AMDVI_FR_PT_ROOT_INV;
+ }
+
+ /*
+ * The Linux kernel driver initializes the default mode to 3, corresponding
+ * to a 39-bit GPA space, where each entry in the pagetable translates to a
+ * 1GB (2^30) page size.
+ */
+ level = mode = get_pte_translation_mode(dte);
+ assert(mode > 0 && mode < 7);
+
+ /*
+ * If IOVA is larger than the max supported by the current pgtable level,
+ * there is nothing to do.
+ */
+ if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
+ /* IOVA too large for the current DTE */
+ return -AMDVI_FR_PT_ROOT_INV;
+ }
+
+ do {
+ level -= 1;
+
+ /* Update the page_size */
+ *page_size = PTE_LEVEL_PAGE_SIZE(level);
+
+ /* Permission bits are ANDed at every level, including the DTE */
+ perms &= amdvi_get_perms(*pte);
+ if (perms == IOMMU_NONE) {
+ return 0;
+ }
+
+ /* Not Present */
+ if (!IOMMU_PTE_PRESENT(*pte)) {
+ return 0;
+ }
+
+ /* Large or Leaf PTE found */
+ if (PTE_NEXT_LEVEL(*pte) == 7 || PTE_NEXT_LEVEL(*pte) == 0) {
+ /* Leaf PTE found */
+ break;
+ }
+
+ /*
+ * Index the pgtable using the IOVA bits corresponding to current level
+ * and walk down to the lower level.
+ */
+ pte_addr = NEXT_PTE_ADDR(*pte, level, address);
+ *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+
+ if (*pte == (uint64_t)-1) {
+ /*
+ * A returned PTE of -1 indicates a failure to read the page table
+ * entry from guest memory.
+ */
+ if (level == mode - 1) {
+ /* Failure to retrieve the Page Table from Root Pointer */
+ *page_size = 0;
+ return -AMDVI_FR_PT_ROOT_INV;
+ } else {
+ /* Failure to read PTE. Page walk skips a page_size chunk */
+ return -AMDVI_FR_PT_ENTRY_INV;
+ }
+ }
+ } while (level > 0);
+
+ assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7 ||
+ level == 0);
+ /*
+ * Page walk ends when Next Level field on PTE shows that either a leaf PTE
+ * or a series of large PTEs have been reached. In the latter case, even if
+ * the range starts in the middle of a contiguous page, the returned PTE
+ * must be the first PTE of the series.
+ */
+ if (PTE_NEXT_LEVEL(*pte) == 7) {
+ /* Update page_size with the large PTE page size */
+ *page_size = large_pte_page_size(*pte);
+ }
+
+ return 0;
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 6f35b0595054..5e35d4b15167 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -25,6 +25,8 @@
#include "hw/i386/x86-iommu.h"
#include "qom/object.h"
+#define GENMASK64(h, l) (((~0ULL) >> (63 - (h) + (l))) << (l))
+
/* Capability registers */
#define AMDVI_CAPAB_BAR_LOW 0x04
#define AMDVI_CAPAB_BAR_HIGH 0x08
@@ -174,6 +176,46 @@
#define AMDVI_GATS_MODE (2ULL << 12)
#define AMDVI_HATS_MODE (2ULL << 10)
+/* Page Table format */
+
+#define AMDVI_PTE_PR (1ULL << 0)
+#define AMDVI_PTE_NEXT_LEVEL_MASK GENMASK64(11, 9)
+
+#define IOMMU_PTE_PRESENT(pte) ((pte) & AMDVI_PTE_PR)
+
+/* Using level=0 for leaf PTE at 4K page size */
+#define PT_LEVEL_SHIFT(level) (12 + ((level) * 9))
+
+/* Return IOVA bit group used to index the Page Table at specific level */
+#define PT_LEVEL_INDEX(level, iova) (((iova) >> PT_LEVEL_SHIFT(level)) & \
+ GENMASK64(8, 0))
+
+/* Return the max address for a specified level i.e. max_oaddr */
+#define PT_LEVEL_MAX_ADDR(x) (((x) < 5) ? \
+ ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
+ (~(0ULL)))
+
+/* Extract the NextLevel field from PTE/PDE */
+#define PTE_NEXT_LEVEL(pte) (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
+
+/* Take page table level and return default pagetable size for level */
+#define PTE_LEVEL_PAGE_SIZE(level) (1ULL << (PT_LEVEL_SHIFT(level)))
+
+/*
+ * Return address of lower level page table encoded in PTE and specified by
+ * current level and corresponding IOVA bit group at such level.
+ */
+#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & AMDVI_DEV_PT_ROOT_MASK) + \
+ (PT_LEVEL_INDEX(level, iova) * 8))
+
+/*
+ * Take a PTE value with mode=0x07 and return the page size it encodes.
+ */
+#define PTE_LARGE_PAGE_SIZE(pte) (1ULL << (1 + cto64(((pte) | 0xfffULL))))
+
+/* Return number of PTEs to use for a given page size (expected power of 2) */
+#define PAGE_SIZE_PTE_COUNT(pgsz) (1ULL << ((ctz64(pgsz) - 12) % 9))
+
/* IOTLB */
#define AMDVI_IOTLB_MAX_SIZE 1024
#define AMDVI_DEVID_SHIFT 36
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 08/20] amd_iommu: Add a page walker to sync shadow page tables on invalidation
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (6 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 07/20] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 09/20] amd_iommu: Add basic structure to support IOMMU notifier updates Alejandro Jimenez
` (14 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
For the specified address range, walk the page table identifying regions
as mapped or unmapped and invoke registered notifiers with the
corresponding event type.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 75 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 44bf8ebe303e..6a2ba878dfa7 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -649,6 +649,81 @@ fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
return 0;
}
+/*
+ * Walk the guest page table for an IOVA and range and signal the registered
+ * notifiers to sync the shadow page tables in the host.
+ * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
+ */
+static void __attribute__((unused))
+amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
+ hwaddr addr, uint64_t size, bool send_unmap)
+{
+ IOMMUTLBEvent event;
+
+ hwaddr iova_next, page_mask, pagesize;
+ hwaddr iova = addr;
+ hwaddr end = iova + size - 1;
+
+ uint64_t pte;
+ int ret;
+
+ while (iova < end) {
+
+ ret = fetch_pte(as, iova, dte[0], &pte, &pagesize);
+
+ if (ret == -AMDVI_FR_PT_ROOT_INV) {
+ /*
+ * Invalid conditions such as the IOVA being larger than supported
+ * by current page table mode as configured in the DTE, or a failure
+ * to fetch the Page Table from the Page Table Root Pointer in DTE.
+ */
+ assert(pagesize == 0);
+ return;
+ }
+ /* PTE has been validated for major errors and pagesize is set */
+ assert(pagesize);
+ page_mask = ~(pagesize - 1);
+ iova_next = (iova & page_mask) + pagesize;
+
+ if (ret == -AMDVI_FR_PT_ENTRY_INV) {
+ /*
+ * Failure to read PTE from memory, the pagesize matches the current
+ * level. Unable to determine the region type, so a safe strategy is
+ * to skip the range and continue the page walk.
+ */
+ goto next;
+ }
+
+ event.entry.target_as = &address_space_memory;
+ event.entry.iova = iova & page_mask;
+ /* translated_addr is irrelevant for the unmap case */
+ event.entry.translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) &
+ page_mask;
+ event.entry.addr_mask = ~page_mask;
+ event.entry.perm = amdvi_get_perms(pte);
+
+ /*
+ * In cases where the leaf PTE is not found, or it has invalid
+ * permissions, an UNMAP type notification is sent, but only if the
+ * caller requested it.
+ */
+ if (!IOMMU_PTE_PRESENT(pte) || (event.entry.perm == IOMMU_NONE)) {
+ if (!send_unmap) {
+ goto next;
+ }
+ event.type = IOMMU_NOTIFIER_UNMAP;
+ } else {
+ event.type = IOMMU_NOTIFIER_MAP;
+ }
+
+ /* Invoke the notifiers registered for this address space */
+ memory_region_notify_iommu(&as->iommu, 0, event);
+
+next:
+ iova = iova_next;
+ }
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 09/20] amd_iommu: Add basic structure to support IOMMU notifier updates
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (7 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 08/20] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-12 6:52 ` Sairaj Kodilkar
2025-06-23 10:53 ` Sairaj Kodilkar
2025-05-02 2:15 ` [PATCH v2 10/20] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
` (13 subsequent siblings)
22 siblings, 2 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Add the minimal data structures required to maintain a list of address
spaces (i.e. devices) with registered notifiers, and to update the type of
events that require notifications.
Note that the ability to register for MAP notifications is not available.
It will be unblocked by following changes that enable the synchronization of
guest I/O page tables with host IOMMU state, at which point an amd-iommu
device property will be introduced to control this capability.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 26 +++++++++++++++++++++++---
hw/i386/amd_iommu.h | 3 +++
2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 6a2ba878dfa7..2f69459ab68d 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -66,6 +66,11 @@ struct AMDVIAddressSpace {
MemoryRegion iommu_nodma; /* Alias of shared nodma memory region */
MemoryRegion iommu_ir; /* Device's interrupt remapping region */
AddressSpace as; /* device's corresponding address space */
+
+ /* DMA address translation support */
+ IOMMUNotifierFlag notifier_flags;
+ /* entry in list of Address spaces with registered notifiers */
+ QLIST_ENTRY(AMDVIAddressSpace) next;
};
/* AMDVI cache entry */
@@ -1711,6 +1716,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
iommu_as[devfn]->bus_num = (uint8_t)bus_num;
iommu_as[devfn]->devfn = (uint8_t)devfn;
iommu_as[devfn]->iommu_state = s;
+ iommu_as[devfn]->notifier_flags = IOMMU_NONE;
amdvi_dev_as = iommu_as[devfn];
@@ -1791,14 +1797,28 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
Error **errp)
{
AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
+ AMDVIState *s = as->iommu_state;
if (new & IOMMU_NOTIFIER_MAP) {
error_setg(errp,
- "device %02x.%02x.%x requires iommu notifier which is not "
- "currently supported", as->bus_num, PCI_SLOT(as->devfn),
- PCI_FUNC(as->devfn));
+ "device %02x.%02x.%x requires iommu notifier which is not "
+ "currently supported", as->bus_num, PCI_SLOT(as->devfn),
+ PCI_FUNC(as->devfn));
return -EINVAL;
}
+
+ /*
+ * Update notifier flags for address space and the list of address spaces
+ * with registered notifiers.
+ */
+ as->notifier_flags = new;
+
+ if (old == IOMMU_NOTIFIER_NONE) {
+ QLIST_INSERT_HEAD(&s->amdvi_as_with_notifiers, as, next);
+ } else if (new == IOMMU_NOTIFIER_NONE) {
+ QLIST_REMOVE(as, next);
+ }
+
return 0;
}
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 5e35d4b15167..a7462b2adb79 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -411,6 +411,9 @@ struct AMDVIState {
/* for each served device */
AMDVIAddressSpace **address_spaces[PCI_BUS_MAX];
+ /* list of address spaces with registered notifiers */
+ QLIST_HEAD(, AMDVIAddressSpace) amdvi_as_with_notifiers;
+
/* IOTLB */
GHashTable *iotlb;
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 09/20] amd_iommu: Add basic structure to support IOMMU notifier updates
2025-05-02 2:15 ` [PATCH v2 09/20] amd_iommu: Add basic structure to support IOMMU notifier updates Alejandro Jimenez
@ 2025-05-12 6:52 ` Sairaj Kodilkar
2025-06-23 10:53 ` Sairaj Kodilkar
1 sibling, 0 replies; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-12 6:52 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> Add the minimal data structures required to maintain a list of address
> spaces (i.e. devices) with registered notifiers, and to update the type of
> events that require notifications.
> Note that the ability to register for MAP notifications is not available.
> It will be unblocked by following changes that enable the synchronization of
> guest I/O page tables with host IOMMU state, at which point an amd-iommu
> device property will be introduced to control this capability.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 26 +++++++++++++++++++++++---
> hw/i386/amd_iommu.h | 3 +++
> 2 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 6a2ba878dfa7..2f69459ab68d 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -66,6 +66,11 @@ struct AMDVIAddressSpace {
> MemoryRegion iommu_nodma; /* Alias of shared nodma memory region */
> MemoryRegion iommu_ir; /* Device's interrupt remapping region */
> AddressSpace as; /* device's corresponding address space */
> +
> + /* DMA address translation support */
> + IOMMUNotifierFlag notifier_flags;
> + /* entry in list of Address spaces with registered notifiers */
> + QLIST_ENTRY(AMDVIAddressSpace) next;
> };
>
> /* AMDVI cache entry */
> @@ -1711,6 +1716,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> iommu_as[devfn]->bus_num = (uint8_t)bus_num;
> iommu_as[devfn]->devfn = (uint8_t)devfn;
> iommu_as[devfn]->iommu_state = s;
> + iommu_as[devfn]->notifier_flags = IOMMU_NONE;
>
> amdvi_dev_as = iommu_as[devfn];
>
> @@ -1791,14 +1797,28 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
> Error **errp)
> {
> AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
> + AMDVIState *s = as->iommu_state;
>
> if (new & IOMMU_NOTIFIER_MAP) {
> error_setg(errp,
> - "device %02x.%02x.%x requires iommu notifier which is not "
> - "currently supported", as->bus_num, PCI_SLOT(as->devfn),
> - PCI_FUNC(as->devfn));
> + "device %02x.%02x.%x requires iommu notifier which is not "
> + "currently supported", as->bus_num, PCI_SLOT(as->devfn),
> + PCI_FUNC(as->devfn));
Redundant whitespace changes, please revert.
Regards
Sairaj Kodilkar
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 09/20] amd_iommu: Add basic structure to support IOMMU notifier updates
2025-05-02 2:15 ` [PATCH v2 09/20] amd_iommu: Add basic structure to support IOMMU notifier updates Alejandro Jimenez
2025-05-12 6:52 ` Sairaj Kodilkar
@ 2025-06-23 10:53 ` Sairaj Kodilkar
1 sibling, 0 replies; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-06-23 10:53 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> Add the minimal data structures required to maintain a list of address
> spaces (i.e. devices) with registered notifiers, and to update the type of
> events that require notifications.
> Note that the ability to register for MAP notifications is not available.
> It will be unblocked by following changes that enable the synchronization of
> guest I/O page tables with host IOMMU state, at which point an amd-iommu
> device property will be introduced to control this capability.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 26 +++++++++++++++++++++++---
> hw/i386/amd_iommu.h | 3 +++
> 2 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 6a2ba878dfa7..2f69459ab68d 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -66,6 +66,11 @@ struct AMDVIAddressSpace {
> MemoryRegion iommu_nodma; /* Alias of shared nodma memory region */
> MemoryRegion iommu_ir; /* Device's interrupt remapping region */
> AddressSpace as; /* device's corresponding address space */
> +
> + /* DMA address translation support */
> + IOMMUNotifierFlag notifier_flags;
> + /* entry in list of Address spaces with registered notifiers */
> + QLIST_ENTRY(AMDVIAddressSpace) next;
> };
>
> /* AMDVI cache entry */
> @@ -1711,6 +1716,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> iommu_as[devfn]->bus_num = (uint8_t)bus_num;
> iommu_as[devfn]->devfn = (uint8_t)devfn;
> iommu_as[devfn]->iommu_state = s;
> + iommu_as[devfn]->notifier_flags = IOMMU_NONE;
Hey
Use IOMMU_NOTIFIER_NONE instead of IOMMU_NONE. Though both are same, the
clang compiler complains about it and fails to compile (GCC works ok).
Thanks
Sairaj Kodilkar
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 10/20] amd_iommu: Sync shadow page tables on page invalidation
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (8 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 09/20] amd_iommu: Add basic structure to support IOMMU notifier updates Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 11/20] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
` (12 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
When the guest issues an INVALIDATE_IOMMU_PAGES command, decode the address
and size of the invalidation and sync the guest page table state with the
host. This requires walking the guest page table and calling notifiers
registered for address spaces matching the domain ID encoded in the command.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 74 insertions(+), 8 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 2f69459ab68d..bddfe2f93136 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -559,9 +559,8 @@ static uint64_t large_pte_page_size(uint64_t pte)
* page table walk. This means that the DTE has valid data, but one of the
* lower level entries in the Page Table could not be read.
*/
-static int __attribute__((unused))
-fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
- hwaddr *page_size)
+static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
+ uint64_t *pte, hwaddr *page_size)
{
IOMMUAccessFlags perms = amdvi_get_perms(dte);
@@ -659,9 +658,9 @@ fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
* notifiers to sync the shadow page tables in the host.
* Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
*/
-static void __attribute__((unused))
-amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
- hwaddr addr, uint64_t size, bool send_unmap)
+static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
+ uint64_t *dte, hwaddr addr,
+ uint64_t size, bool send_unmap)
{
IOMMUTLBEvent event;
@@ -798,8 +797,7 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
* first zero at bit 51 or larger is a request to invalidate the entire address
* space.
*/
-static uint64_t __attribute__((unused))
-amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
+static uint64_t amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
{
uint64_t size = AMDVI_PAGE_SIZE;
uint8_t fzbit = 0;
@@ -816,10 +814,76 @@ amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
return size;
}
+/*
+ * Synchronize the guest page tables with the shadow page tables kept in the
+ * host for the specified range.
+ * The invalidation command issued by the guest and intercepted by the VMM
+ * does not specify a device, but a domain, since all devices in the same domain
+ * share the same page tables. However, vIOMMU emulation creates separate
+ * address spaces per device, so it is necessary to traverse the list of all of
+ * address spaces (i.e. devices) that have notifiers registered in order to
+ * propagate the changes to the host page tables.
+ * We cannot return early from this function once a matching domain has been
+ * identified and its page tables synced (based on the fact that all devices in
+ * the same domain share the page tables). The reason is that different devices
+ * (i.e. address spaces) could have different notifiers registered, and by
+ * skipping address spaces that appear later on the amdvi_as_with_notifiers list
+ * their notifiers (which could differ from the ones registered for the first
+ * device/address space) would not be invoked.
+ */
+static void amdvi_sync_domain(AMDVIState *s, uint16_t domid, uint64_t addr,
+ uint16_t flags)
+{
+ AMDVIAddressSpace *as;
+
+ uint64_t size = amdvi_decode_invalidation_size(addr, flags);
+
+ if (size == AMDVI_INV_ALL_PAGES) {
+ addr = 0; /* Set start address to 0 and invalidate entire AS */
+ } else {
+ addr &= ~(size - 1);
+ }
+
+ /*
+ * Call notifiers that have registered for each address space matching the
+ * domain ID, in order to sync the guest pagetable state with the host.
+ */
+ QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+
+ uint64_t dte[4] = { 0 };
+
+ /*
+ * Retrieve the Device Table entry for the devid corresponding to the
+ * current address space, and verify the DomainID matches i.e. the page
+ * tables to be synced belong to devices in the domain.
+ */
+ if (amdvi_as_to_dte(as, dte)) {
+ continue;
+ }
+
+ /* Only need to sync the Page Tables for a matching domain */
+ if (domid != (dte[1] & AMDVI_DEV_DOMID_ID_MASK)) {
+ continue;
+ }
+
+ /*
+ * We have determined that there is a valid Device Table Entry for a
+ * device matching the DomainID in the INV_IOMMU_PAGES command issued by
+ * the guest. Walk the guest page table to sync shadow page table.
+ */
+ if (as->notifier_flags & IOMMU_NOTIFIER_MAP) {
+ /* Sync guest IOMMU mappings with host */
+ amdvi_sync_shadow_page_table_range(as, &dte[0], addr, size, true);
+ }
+ }
+}
+
/* we don't have devid - we can't remove pages by address */
static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
{
uint16_t domid = cpu_to_le16((uint16_t)extract64(cmd[0], 32, 16));
+ uint64_t addr = cpu_to_le64(extract64(cmd[1], 12, 52)) << 12;
+ uint16_t flags = cpu_to_le16((uint16_t)extract64(cmd[1], 0, 3));
if (extract64(cmd[0], 20, 12) || extract64(cmd[0], 48, 12) ||
extract64(cmd[1], 3, 9)) {
@@ -829,6 +893,8 @@ static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
g_hash_table_foreach_remove(s->iotlb, amdvi_iotlb_remove_by_domid,
&domid);
+
+ amdvi_sync_domain(s, domid, addr, flags);
trace_amdvi_pages_inval(domid);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 11/20] amd_iommu: Use iova_tree records to determine large page size on UNMAP
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (9 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 10/20] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-06-11 8:29 ` Sairaj Kodilkar
2025-05-02 2:15 ` [PATCH v2 12/20] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
` (11 subsequent siblings)
22 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Keep a record of mapped IOVA ranges per address space, using the iova_tree
implementation. Besides enabling optimizations like avoiding unnecessary
notifications, a record of existing <IOVA, size> mappings makes it possible
to determine if a specific IOVA is mapped by the guest using a large page,
and adjust the size when notifying UNMAP events.
When unmapping a large page, the information in the guest PTE encoding the
page size is lost, since the guest clears the PTE before issuing the
invalidation command to the IOMMU. In such case, the size of the original
mapping can be retrieved from the iova_tree and used to issue the UNMAP
notification. Using the correct size is essential since the VFIO IOMMU
Type1v2 driver in the host kernel will reject unmap requests that do not
fully cover previous mappings.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 86 insertions(+), 5 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index bddfe2f93136..4f44ef159ff9 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -33,6 +33,7 @@
#include "hw/i386/apic-msidef.h"
#include "hw/qdev-properties.h"
#include "kvm/kvm_i386.h"
+#include "qemu/iova-tree.h"
/* used AMD-Vi MMIO registers */
const char *amdvi_mmio_low[] = {
@@ -71,6 +72,8 @@ struct AMDVIAddressSpace {
IOMMUNotifierFlag notifier_flags;
/* entry in list of Address spaces with registered notifiers */
QLIST_ENTRY(AMDVIAddressSpace) next;
+ /* Record DMA translation ranges */
+ IOVATree *iova_tree;
};
/* AMDVI cache entry */
@@ -653,6 +656,75 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
return 0;
}
+/*
+ * Invoke notifiers registered for the address space. Update record of mapped
+ * ranges in IOVA Tree.
+ */
+static void amdvi_notify_iommu(AMDVIAddressSpace *as, IOMMUTLBEvent *event)
+{
+ IOMMUTLBEntry *entry = &event->entry;
+
+ DMAMap target = {
+ .iova = entry->iova,
+ .size = entry->addr_mask,
+ .translated_addr = entry->translated_addr,
+ .perm = entry->perm,
+ };
+
+ /*
+ * Search the IOVA Tree for an existing translation for the target, and skip
+ * the notification if the mapping is already recorded.
+ * When the guest uses large pages, comparing against the record makes it
+ * possible to determine the size of the original MAP and adjust the UNMAP
+ * request to match it. This avoids failed checks against the mappings kept
+ * by the VFIO kernel driver.
+ */
+ const DMAMap *mapped = iova_tree_find(as->iova_tree, &target);
+
+ if (event->type == IOMMU_NOTIFIER_UNMAP) {
+ if (!mapped) {
+ /* No record exists of this mapping, nothing to do */
+ return;
+ }
+ /*
+ * Adjust the size based on the original record. This is essential to
+ * determine when large/contiguous pages are used, since the guest has
+ * already cleared the PTE (erasing the pagesize encoded on it) before
+ * issuing the invalidation command.
+ */
+ if (mapped->size != target.size) {
+ assert(mapped->size > target.size);
+ target.size = mapped->size;
+ /* Adjust event to invoke notifier with correct range */
+ entry->addr_mask = mapped->size;
+ }
+ iova_tree_remove(as->iova_tree, target);
+ } else { /* IOMMU_NOTIFIER_MAP */
+ if (mapped) {
+ /*
+ * If a mapping is present and matches the request, skip the
+ * notification.
+ */
+ if (!memcmp(mapped, &target, sizeof(DMAMap))) {
+ return;
+ } else {
+ /*
+ * This should never happen unless a buggy guest OS omits or
+ * sends incorrect invalidation(s). Report an error in the event
+ * it does happen.
+ */
+ error_report("Found conflicting translation. This could be due "
+ "to an incorrect or missing invalidation command");
+ }
+ }
+ /* Record the new mapping */
+ iova_tree_insert(as->iova_tree, &target);
+ }
+
+ /* Invoke the notifiers registered for this address space */
+ memory_region_notify_iommu(&as->iommu, 0, *event);
+}
+
/*
* Walk the guest page table for an IOVA and range and signal the registered
* notifiers to sync the shadow page tables in the host.
@@ -664,7 +736,7 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
{
IOMMUTLBEvent event;
- hwaddr iova_next, page_mask, pagesize;
+ hwaddr page_mask, pagesize;
hwaddr iova = addr;
hwaddr end = iova + size - 1;
@@ -687,7 +759,6 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
/* PTE has been validated for major errors and pagesize is set */
assert(pagesize);
page_mask = ~(pagesize - 1);
- iova_next = (iova & page_mask) + pagesize;
if (ret == -AMDVI_FR_PT_ENTRY_INV) {
/*
@@ -720,11 +791,20 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
event.type = IOMMU_NOTIFIER_MAP;
}
- /* Invoke the notifiers registered for this address space */
- memory_region_notify_iommu(&as->iommu, 0, event);
+ /*
+ * The following call might need to adjust event.entry.size in cases
+ * where the guest unmapped a series of large pages.
+ */
+ amdvi_notify_iommu(as, &event);
+ /*
+ * In the special scenario where the guest is unmapping a large page,
+ * addr_mask has been adjusted before sending the notification. Update
+ * pagesize accordingly in order to correctly compute the next IOVA.
+ */
+ pagesize = event.entry.addr_mask + 1;
next:
- iova = iova_next;
+ iova = (iova & ~(pagesize - 1)) + pagesize;
}
}
@@ -1783,6 +1863,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
iommu_as[devfn]->devfn = (uint8_t)devfn;
iommu_as[devfn]->iommu_state = s;
iommu_as[devfn]->notifier_flags = IOMMU_NONE;
+ iommu_as[devfn]->iova_tree = iova_tree_new();
amdvi_dev_as = iommu_as[devfn];
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 11/20] amd_iommu: Use iova_tree records to determine large page size on UNMAP
2025-05-02 2:15 ` [PATCH v2 11/20] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
@ 2025-06-11 8:29 ` Sairaj Kodilkar
2025-06-13 21:50 ` Alejandro Jimenez
0 siblings, 1 reply; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-06-11 8:29 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> Keep a record of mapped IOVA ranges per address space, using the iova_tree
> implementation. Besides enabling optimizations like avoiding unnecessary
> notifications, a record of existing <IOVA, size> mappings makes it possible
> to determine if a specific IOVA is mapped by the guest using a large page,
> and adjust the size when notifying UNMAP events.
>
> When unmapping a large page, the information in the guest PTE encoding the
> page size is lost, since the guest clears the PTE before issuing the
> invalidation command to the IOMMU. In such case, the size of the original
> mapping can be retrieved from the iova_tree and used to issue the UNMAP
> notification. Using the correct size is essential since the VFIO IOMMU
> Type1v2 driver in the host kernel will reject unmap requests that do not
> fully cover previous mappings.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 91 ++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 86 insertions(+), 5 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index bddfe2f93136..4f44ef159ff9 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -33,6 +33,7 @@
> #include "hw/i386/apic-msidef.h"
> #include "hw/qdev-properties.h"
> #include "kvm/kvm_i386.h"
> +#include "qemu/iova-tree.h"
>
> /* used AMD-Vi MMIO registers */
> const char *amdvi_mmio_low[] = {
> @@ -71,6 +72,8 @@ struct AMDVIAddressSpace {
> IOMMUNotifierFlag notifier_flags;
> /* entry in list of Address spaces with registered notifiers */
> QLIST_ENTRY(AMDVIAddressSpace) next;
> + /* Record DMA translation ranges */
> + IOVATree *iova_tree;
> };
>
> /* AMDVI cache entry */
> @@ -653,6 +656,75 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
> return 0;
> }
>
> +/*
> + * Invoke notifiers registered for the address space. Update record of mapped
> + * ranges in IOVA Tree.
> + */
> +static void amdvi_notify_iommu(AMDVIAddressSpace *as, IOMMUTLBEvent *event)
> +{
> + IOMMUTLBEntry *entry = &event->entry;
> +
> + DMAMap target = {
> + .iova = entry->iova,
> + .size = entry->addr_mask,
> + .translated_addr = entry->translated_addr,
> + .perm = entry->perm,
> + };
> +
> + /*
> + * Search the IOVA Tree for an existing translation for the target, and skip
> + * the notification if the mapping is already recorded.
> + * When the guest uses large pages, comparing against the record makes it
> + * possible to determine the size of the original MAP and adjust the UNMAP
> + * request to match it. This avoids failed checks against the mappings kept
> + * by the VFIO kernel driver.
> + */
> + const DMAMap *mapped = iova_tree_find(as->iova_tree, &target);
> +
> + if (event->type == IOMMU_NOTIFIER_UNMAP) {
> + if (!mapped) {
> + /* No record exists of this mapping, nothing to do */
> + return;
> + }
> + /*
> + * Adjust the size based on the original record. This is essential to
> + * determine when large/contiguous pages are used, since the guest has
> + * already cleared the PTE (erasing the pagesize encoded on it) before
> + * issuing the invalidation command.
> + */
> + if (mapped->size != target.size) {
> + assert(mapped->size > target.size);
> + target.size = mapped->size;
> + /* Adjust event to invoke notifier with correct range */
> + entry->addr_mask = mapped->size;
> + }
> + iova_tree_remove(as->iova_tree, target);
> + } else { /* IOMMU_NOTIFIER_MAP */
> + if (mapped) {
> + /*
> + * If a mapping is present and matches the request, skip the
> + * notification.
> + */
> + if (!memcmp(mapped, &target, sizeof(DMAMap))) {
> + return;
> + } else {
> + /*
> + * This should never happen unless a buggy guest OS omits or
> + * sends incorrect invalidation(s). Report an error in the event
> + * it does happen.
> + */
> + error_report("Found conflicting translation. This could be due "
> + "to an incorrect or missing invalidation command");
> + }
> + }
> + /* Record the new mapping */
> + iova_tree_insert(as->iova_tree, &target);
> + }
> +
> + /* Invoke the notifiers registered for this address space */
> + memory_region_notify_iommu(&as->iommu, 0, *event);
> +}
> +
> /*
> * Walk the guest page table for an IOVA and range and signal the registered
> * notifiers to sync the shadow page tables in the host.
> @@ -664,7 +736,7 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
> {
> IOMMUTLBEvent event;
>
> - hwaddr iova_next, page_mask, pagesize;
> + hwaddr page_mask, pagesize;
> hwaddr iova = addr;
> hwaddr end = iova + size - 1;
>
> @@ -687,7 +759,6 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
> /* PTE has been validated for major errors and pagesize is set */
> assert(pagesize);
> page_mask = ~(pagesize - 1);
> - iova_next = (iova & page_mask) + pagesize;
>
> if (ret == -AMDVI_FR_PT_ENTRY_INV) {
> /*
> @@ -720,11 +791,20 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
> event.type = IOMMU_NOTIFIER_MAP;
> }
>
> - /* Invoke the notifiers registered for this address space */
> - memory_region_notify_iommu(&as->iommu, 0, event);
> + /*
> + * The following call might need to adjust event.entry.size in cases
> + * where the guest unmapped a series of large pages.
> + */
> + amdvi_notify_iommu(as, &event);
> + /*
> + * In the special scenario where the guest is unmapping a large page,
> + * addr_mask has been adjusted before sending the notification. Update
> + * pagesize accordingly in order to correctly compute the next IOVA.
> + */
> + pagesize = event.entry.addr_mask + 1;
>
> next:
> - iova = iova_next;
> + iova = (iova & ~(pagesize - 1)) + pagesize;
Hi Alejandro,
While experimenting with iommu.forcedac=1, I found that above line
causes unsigned integer overflow for 64 bit IOVAs. This results in an
infinite loop.
Please add a overflow check here.
Thanks
Sairaj Kodilkar
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 11/20] amd_iommu: Use iova_tree records to determine large page size on UNMAP
2025-06-11 8:29 ` Sairaj Kodilkar
@ 2025-06-13 21:50 ` Alejandro Jimenez
0 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-06-13 21:50 UTC (permalink / raw)
To: Sairaj Kodilkar, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 6/11/25 4:29 AM, Sairaj Kodilkar wrote:
>
>
> On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
>> next:
>> - iova = iova_next;
>> + iova = (iova & ~(pagesize - 1)) + pagesize;
>
> Hi Alejandro,
> While experimenting with iommu.forcedac=1, I found that above line
> causes unsigned integer overflow for 64 bit IOVAs. This results in an
> infinite loop.
>
> Please add a overflow check here.
>
Good catch. Reproduced it and tested the fix; will include it in next
revision.
Thank you,
Alejandro
> Thanks
> Sairaj Kodilkar
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 12/20] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (10 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 11/20] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 13/20] amd_iommu: Add replay callback Alejandro Jimenez
` (10 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Support dropping all existing mappings on reset. When the guest kernel
reboots it will create new ones, but other components that run before
the kernel (e.g. OVMF) should not be able to use existing mappings from
the previous boot.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 4f44ef159ff9..7bcba47a01ba 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -808,6 +808,77 @@ next:
}
}
+/*
+ * Unmap entire range that the notifier registered for i.e. the full AS.
+ *
+ * This is seemingly technically equivalent to directly calling
+ * memory_region_unmap_iommu_notifier_range(), but it allows to check for
+ * notifier boundaries and issue notifications with ranges within those bounds.
+ */
+static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n)
+{
+
+ hwaddr start = n->start;
+ hwaddr end = n->end;
+ hwaddr remain;
+ DMAMap map;
+
+ assert(start <= end);
+ remain = end - start + 1;
+
+ /*
+ * Divide the notifier range into chunks that are aligned and do not exceed
+ * the notifier boundaries.
+ */
+ while (remain >= AMDVI_PAGE_SIZE) {
+
+ IOMMUTLBEvent event;
+
+ uint64_t mask = dma_aligned_pow2_mask(start, end, 64);
+
+ event.type = IOMMU_NOTIFIER_UNMAP;
+
+ IOMMUTLBEntry entry = {
+ .target_as = &address_space_memory,
+ .iova = start,
+ .translated_addr = 0, /* irrelevant for unmap case */
+ .addr_mask = mask,
+ .perm = IOMMU_NONE,
+ };
+ event.entry = entry;
+
+ /* Call notifier registered for updates on this address space */
+ memory_region_notify_iommu_one(n, &event);
+
+ start += mask + 1;
+ remain -= mask + 1;
+ }
+
+ assert(!remain);
+
+ map.iova = n->start;
+ map.size = n->end - n->start;
+
+ iova_tree_remove(as->iova_tree, map);
+}
+
+/*
+ * For all the address spaces with notifiers registered, unmap the entire range
+ * the notifier registered for i.e. clear all the address spaces managed by the
+ * IOMMU.
+ */
+static void amdvi_address_space_unmap_all(AMDVIState *s)
+{
+ AMDVIAddressSpace *as;
+ IOMMUNotifier *n;
+
+ QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+ IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+ amdvi_address_space_unmap(as, n);
+ }
+ }
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
@@ -2043,6 +2114,9 @@ static void amdvi_sysbus_reset(DeviceState *dev)
msi_reset(&s->pci.dev);
amdvi_init(s);
+
+ /* Discard all mappings on device reset */
+ amdvi_address_space_unmap_all(s);
}
static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 13/20] amd_iommu: Add replay callback
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (11 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 12/20] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:15 ` [PATCH v2 14/20] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
` (9 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
A replay() method is necessary to efficiently synchronize the host page
tables after VFIO registers a notifier for IOMMU events. It is called to
ensure that existing mappings from an IOMMU memory region are "replayed" to
a specified notifier, initializing or updating the shadow page tables on the
host.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 7bcba47a01ba..5ce74f2c052d 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -879,6 +879,29 @@ static void amdvi_address_space_unmap_all(AMDVIState *s)
}
}
+/*
+ * For every translation present in the IOMMU, construct IOMMUTLBEntry data
+ * and pass it as parameter to notifier callback.
+ */
+static void amdvi_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
+{
+ AMDVIAddressSpace *as = container_of(iommu_mr, AMDVIAddressSpace, iommu);
+ uint64_t dte[4] = { 0 };
+
+ if (!(n->notifier_flags & IOMMU_NOTIFIER_MAP)) {
+ return;
+ }
+
+ if (amdvi_as_to_dte(as, dte)) {
+ return;
+ }
+
+ /* Dropping all mappings for the address space. Also clears the IOVA tree */
+ amdvi_address_space_unmap(as, n);
+
+ amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, false);
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
@@ -2240,6 +2263,7 @@ static void amdvi_iommu_memory_region_class_init(ObjectClass *klass,
imrc->translate = amdvi_translate;
imrc->notify_flag_changed = amdvi_iommu_notify_flag_changed;
+ imrc->replay = amdvi_iommu_replay;
}
static const TypeInfo amdvi_iommu_memory_region_info = {
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 14/20] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (12 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 13/20] amd_iommu: Add replay callback Alejandro Jimenez
@ 2025-05-02 2:15 ` Alejandro Jimenez
2025-05-02 2:16 ` [PATCH v2 15/20] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
` (8 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:15 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
When the kernel IOMMU driver issues an INVALIDATE_IOMMU_ALL, the address
translation and interrupt remapping information must be cleared for all
Device IDs and all domains. Introduce a helper to sync the shadow page table
for all the address spaces with registered notifiers, which replays both MAP
and UNMAP events.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 5ce74f2c052d..fa5dbc3cc700 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -902,6 +902,47 @@ static void amdvi_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, false);
}
+static void amdvi_address_space_sync(AMDVIAddressSpace *as)
+{
+ IOMMUNotifier *n;
+ uint64_t dte[4] = { 0 };
+
+ /* If only UNMAP notifiers are registered, drop all existing mappings */
+ if (!(as->notifier_flags & IOMMU_NOTIFIER_MAP)) {
+ IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+ /*
+ * Directly calling memory_region_unmap_iommu_notifier_range() does
+ * not guarantee that the addr_mask eventually passed as parameter
+ * to the notifier is valid. Use amdvi_address_space_unmap() which
+ * ensures the notifier range is divided into properly aligned
+ * regions, and issues notifications for each one.
+ */
+ amdvi_address_space_unmap(as, n);
+ }
+ return;
+ }
+
+ if (amdvi_as_to_dte(as, dte)) {
+ return;
+ }
+
+ amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, true);
+}
+
+/*
+ * This differs from the replay() method in that it issues both MAP and UNMAP
+ * notifications since it is called after global invalidation events in order to
+ * re-sync all address spaces.
+ */
+static void amdvi_iommu_address_space_sync_all(AMDVIState *s)
+{
+ AMDVIAddressSpace *as;
+
+ QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+ amdvi_address_space_sync(as);
+ }
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
@@ -944,6 +985,13 @@ static void amdvi_inval_all(AMDVIState *s, uint64_t *cmd)
amdvi_intremap_inval_notify_all(s, true, 0, 0);
amdvi_iotlb_reset(s);
+
+ /*
+ * Fully replay the address space i.e. send both UNMAP and MAP events in
+ * order to synchronize guest and host IO page tables tables.
+ */
+ amdvi_iommu_address_space_sync_all(s);
+
trace_amdvi_all_inval();
}
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 15/20] amd_iommu: Toggle memory regions based on address translation mode
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (13 preceding siblings ...)
2025-05-02 2:15 ` [PATCH v2 14/20] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
@ 2025-05-02 2:16 ` Alejandro Jimenez
2025-05-12 6:52 ` Sairaj Kodilkar
2025-05-02 2:16 ` [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset Alejandro Jimenez
` (7 subsequent siblings)
22 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:16 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Enable the appropriate memory region for an address space depending on the
address translation mode selected for it. This is currently based on a
generic x86 IOMMMU property, and only done during the address space
initialization. Extract the code into a helper and toggle the regions based
on whether the specific address space is using address translation (via the
newly introduced addr_translation field). Later, region activation will also
be controlled by availability of DMA remapping capability (via dma-remap
property to be introduced in follow up changes).
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 31 +++++++++++++++++++++----------
1 file changed, 21 insertions(+), 10 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index fa5dbc3cc700..71018d70dd10 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -74,6 +74,8 @@ struct AMDVIAddressSpace {
QLIST_ENTRY(AMDVIAddressSpace) next;
/* Record DMA translation ranges */
IOVATree *iova_tree;
+ /* DMA address translation active */
+ bool addr_translation;
};
/* AMDVI cache entry */
@@ -943,6 +945,23 @@ static void amdvi_iommu_address_space_sync_all(AMDVIState *s)
}
}
+/*
+ * Toggle between address translation and passthrough modes by enabling the
+ * corresponding memory regions.
+ */
+static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
+{
+ if (amdvi_as->addr_translation) {
+ /* Enabling DMA region */
+ memory_region_set_enabled(&amdvi_as->iommu_nodma, false);
+ memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), true);
+ } else {
+ /* Disabling DMA region, using passthrough */
+ memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), false);
+ memory_region_set_enabled(&amdvi_as->iommu_nodma, true);
+ }
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
@@ -1986,7 +2005,6 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
AMDVIState *s = opaque;
AMDVIAddressSpace **iommu_as, *amdvi_dev_as;
int bus_num = pci_bus_num(bus);
- X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
iommu_as = s->address_spaces[bus_num];
@@ -2006,6 +2024,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
iommu_as[devfn]->iommu_state = s;
iommu_as[devfn]->notifier_flags = IOMMU_NONE;
iommu_as[devfn]->iova_tree = iova_tree_new();
+ iommu_as[devfn]->addr_translation = false;
amdvi_dev_as = iommu_as[devfn];
@@ -2048,15 +2067,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
AMDVI_INT_ADDR_FIRST,
&amdvi_dev_as->iommu_ir, 1);
- if (!x86_iommu->pt_supported) {
- memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, false);
- memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as->iommu),
- true);
- } else {
- memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as->iommu),
- false);
- memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, true);
- }
+ amdvi_switch_address_space(amdvi_dev_as);
}
return &iommu_as[devfn]->as;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 15/20] amd_iommu: Toggle memory regions based on address translation mode
2025-05-02 2:16 ` [PATCH v2 15/20] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
@ 2025-05-12 6:52 ` Sairaj Kodilkar
0 siblings, 0 replies; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-12 6:52 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
> Enable the appropriate memory region for an address space depending on the
> address translation mode selected for it. This is currently based on a
> generic x86 IOMMMU property, and only done during the address space
s/IOMMMU/IOMMU
> initialization. Extract the code into a helper and toggle the regions based
> on whether the specific address space is using address translation (via the
> newly introduced addr_translation field). Later, region activation will also
> be controlled by availability of DMA remapping capability (via dma-remap
> property to be introduced in follow up changes).
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Regards
Sairaj Kodilkar
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (14 preceding siblings ...)
2025-05-02 2:16 ` [PATCH v2 15/20] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
@ 2025-05-02 2:16 ` Alejandro Jimenez
2025-05-29 6:16 ` Sairaj Kodilkar
2025-05-02 2:16 ` [PATCH v2 17/20] amd_iommu: Add dma-remap property to AMD vIOMMU device Alejandro Jimenez
` (6 subsequent siblings)
22 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:16 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
On reset, restore the default address translation mode for all the
address spaces managed by the vIOMMU.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 71018d70dd10..90491367594b 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -962,6 +962,33 @@ static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
}
}
+/*
+ * For all existing address spaces managed by the IOMMU, enable/disable the
+ * corresponding memory regions depending on the address translation mode
+ * as determined by the global and individual address space settings.
+ */
+static void amdvi_switch_address_space_all(AMDVIState *s)
+{
+ AMDVIAddressSpace **iommu_as;
+
+ for (int bus_num = 0; bus_num < PCI_BUS_MAX; bus_num++) {
+
+ /* Nothing to do if there are no devices on the current bus */
+ if (!s->address_spaces[bus_num]) {
+ continue;
+ }
+ iommu_as = s->address_spaces[bus_num];
+
+ for (int devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
+
+ if (!iommu_as[devfn]) {
+ continue;
+ }
+ amdvi_switch_address_space(iommu_as[devfn]);
+ }
+ }
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
@@ -2199,6 +2226,7 @@ static void amdvi_sysbus_reset(DeviceState *dev)
/* Discard all mappings on device reset */
amdvi_address_space_unmap_all(s);
+ amdvi_switch_address_space_all(s);
}
static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset
2025-05-02 2:16 ` [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset Alejandro Jimenez
@ 2025-05-29 6:16 ` Sairaj Kodilkar
2025-05-30 21:30 ` Alejandro Jimenez
0 siblings, 1 reply; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-29 6:16 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
> On reset, restore the default address translation mode for all the
> address spaces managed by the vIOMMU.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 71018d70dd10..90491367594b 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -962,6 +962,33 @@ static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
> }
> }
>
> +/*
> + * For all existing address spaces managed by the IOMMU, enable/disable the
> + * corresponding memory regions depending on the address translation mode
> + * as determined by the global and individual address space settings.
> + */
> +static void amdvi_switch_address_space_all(AMDVIState *s)
> +{
> + AMDVIAddressSpace **iommu_as;
> +
> + for (int bus_num = 0; bus_num < PCI_BUS_MAX; bus_num++) {
> +
> + /* Nothing to do if there are no devices on the current bus */
> + if (!s->address_spaces[bus_num]) {
> + continue;
> + }
> + iommu_as = s->address_spaces[bus_num];
> +
> + for (int devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> +
> + if (!iommu_as[devfn]) {
> + continue;
> + }
> + amdvi_switch_address_space(iommu_as[devfn]);
> + }
> + }
> +}
> +
> /* log error without aborting since linux seems to be using reserved bits */
> static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
> {
> @@ -2199,6 +2226,7 @@ static void amdvi_sysbus_reset(DeviceState *dev)
>
> /* Discard all mappings on device reset */
> amdvi_address_space_unmap_all(s);
> + amdvi_switch_address_space_all(s);
Hi Alejandro
I think amdvi_sysbus_reset should set iommu_as->addr_translation flag to
"false" before switching all the address spaces. Without this, the
devices will keep using IOMMU address space.
Regards
Sairaj Kodilkar
> }
>
> static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset
2025-05-29 6:16 ` Sairaj Kodilkar
@ 2025-05-30 21:30 ` Alejandro Jimenez
2025-06-13 8:46 ` Sairaj Kodilkar
0 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-30 21:30 UTC (permalink / raw)
To: Sairaj Kodilkar, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
Hey Sairaj,
On 5/29/25 2:16 AM, Sairaj Kodilkar wrote:
>
>
> On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
>> On reset, restore the default address translation mode for all the
>> address spaces managed by the vIOMMU.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>> hw/i386/amd_iommu.c | 28 ++++++++++++++++++++++++++++
>> 1 file changed, 28 insertions(+)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index 71018d70dd10..90491367594b 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -962,6 +962,33 @@ static void
>> amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
>> }
>> }
>> +/*
>> + * For all existing address spaces managed by the IOMMU, enable/
>> disable the
>> + * corresponding memory regions depending on the address translation
>> mode
>> + * as determined by the global and individual address space settings.
>> + */
>> +static void amdvi_switch_address_space_all(AMDVIState *s)
>> +{
>> + AMDVIAddressSpace **iommu_as;
>> +
>> + for (int bus_num = 0; bus_num < PCI_BUS_MAX; bus_num++) {
>> +
>> + /* Nothing to do if there are no devices on the current bus */
>> + if (!s->address_spaces[bus_num]) {
>> + continue;
>> + }
>> + iommu_as = s->address_spaces[bus_num];
>> +
>> + for (int devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
>> +
>> + if (!iommu_as[devfn]) {
>> + continue;
>> + }
>> + amdvi_switch_address_space(iommu_as[devfn]);
>> + }
>> + }
>> +}
>> +
>> /* log error without aborting since linux seems to be using reserved
>> bits */
>> static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
>> {
>> @@ -2199,6 +2226,7 @@ static void amdvi_sysbus_reset(DeviceState *dev)
>> /* Discard all mappings on device reset */
>> amdvi_address_space_unmap_all(s);
>> + amdvi_switch_address_space_all(s);
>
> Hi Alejandro
>
> I think amdvi_sysbus_reset should set iommu_as->addr_translation flag to
> "false" before switching all the address spaces. Without this, the
> devices will keep using IOMMU address space.
>
My first impulse is to agree with you, from the standpoint of
considering the no_dma mode as the "default mode", and a reset should
bring us back to default. But I wonder that is necessarily the
architectural behavior...
After a reset, in order for any device transactions to be processed, a
guest driver must reinitialize the IOMMU data structures, including the
Device Table and specifically the table entry for the device. That must
trigger a INVAL_DEVTAB_ENTRY that will be intercepted and
as->addr_translation will be set correctly. If the guest driver doesn't
do these operations, then a device won't be able to use the IOMMU
because it doesn't have a valid DTE, right? The earlier mappings were
already dropped, so it doesn't affect the host.
Again, I see your point, and making this change is likely the right
thing to do, which is why I'll make the change for v3. Just wondering if
implementing such behavior is actually architecturally accurate or just
the "common sense" approach...
Thank you for your attention to detail and all the valuable feedback. I
will be out next week, and will send v3 once I am back online.
Alejandro
> Regards
> Sairaj Kodilkar
>> }
>> static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset
2025-05-30 21:30 ` Alejandro Jimenez
@ 2025-06-13 8:46 ` Sairaj Kodilkar
2025-06-23 22:08 ` Alejandro Jimenez
0 siblings, 1 reply; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-06-13 8:46 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/31/2025 3:00 AM, Alejandro Jimenez wrote:
> Hey Sairaj,
>
> On 5/29/25 2:16 AM, Sairaj Kodilkar wrote:
>>
>>
>> On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
>>> On reset, restore the default address translation mode for all the
>>> address spaces managed by the vIOMMU.
>>>
>>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>> ---
>>> hw/i386/amd_iommu.c | 28 ++++++++++++++++++++++++++++
>>> 1 file changed, 28 insertions(+)
>>>
>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>>> index 71018d70dd10..90491367594b 100644
>>> --- a/hw/i386/amd_iommu.c
>>> +++ b/hw/i386/amd_iommu.c
>>> @@ -962,6 +962,33 @@ static void
>>> amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
>>> }
>>> }
>>> +/*
>>> + * For all existing address spaces managed by the IOMMU, enable/
>>> disable the
>>> + * corresponding memory regions depending on the address translation
>>> mode
>>> + * as determined by the global and individual address space settings.
>>> + */
>>> +static void amdvi_switch_address_space_all(AMDVIState *s)
>>> +{
>>> + AMDVIAddressSpace **iommu_as;
>>> +
>>> + for (int bus_num = 0; bus_num < PCI_BUS_MAX; bus_num++) {
>>> +
>>> + /* Nothing to do if there are no devices on the current bus */
>>> + if (!s->address_spaces[bus_num]) {
>>> + continue;
>>> + }
>>> + iommu_as = s->address_spaces[bus_num];
>>> +
>>> + for (int devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
>>> +
>>> + if (!iommu_as[devfn]) {
>>> + continue;
>>> + }
>>> + amdvi_switch_address_space(iommu_as[devfn]);
>>> + }
>>> + }
>>> +}
>>> +
>>> /* log error without aborting since linux seems to be using
>>> reserved bits */
>>> static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
>>> {
>>> @@ -2199,6 +2226,7 @@ static void amdvi_sysbus_reset(DeviceState *dev)
>>> /* Discard all mappings on device reset */
>>> amdvi_address_space_unmap_all(s);
>>> + amdvi_switch_address_space_all(s);
>>
>> Hi Alejandro
>>
>> I think amdvi_sysbus_reset should set iommu_as->addr_translation flag to
>> "false" before switching all the address spaces. Without this, the
>> devices will keep using IOMMU address space.
>>
> My first impulse is to agree with you, from the standpoint of
> considering the no_dma mode as the "default mode", and a reset should
> bring us back to default. But I wonder that is necessarily the
> architectural behavior...
>
> After a reset, in order for any device transactions to be processed, a
> guest driver must reinitialize the IOMMU data structures, including the
> Device Table and specifically the table entry for the device. That must
> trigger a INVAL_DEVTAB_ENTRY that will be intercepted and as-
> >addr_translation will be set correctly. If the guest driver doesn't do
> these operations, then a device won't be able to use the IOMMU because
> it doesn't have a valid DTE, right? The earlier mappings were already
> dropped, so it doesn't affect the host.
>
> Again, I see your point, and making this change is likely the right
> thing to do, which is why I'll make the change for v3. Just wondering if
> implementing such behavior is actually architecturally accurate or just
> the "common sense" approach...
>
> Thank you for your attention to detail and all the valuable feedback. I
> will be out next week, and will send v3 once I am back online.
>
> Alejandro
>
Hey
I tested current patches on OVMF and reboot crashes with IO_PAGE_FAULT
logs on _host_. But setting "addr_translation=false" fixes this crash.
I think the reason is that, kernel does not reset DTE while shutting
down the system. This keeps the stale mappings (IOVA->SPA) still in the
IOMMU and OVMF initiates some DMA operations on device before guest
reboots and sets up the new mappings.
Thanks
Sairaj
>> Regards
>> Sairaj Kodilkar
>>> }
>>> static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
>>
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset
2025-06-13 8:46 ` Sairaj Kodilkar
@ 2025-06-23 22:08 ` Alejandro Jimenez
0 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-06-23 22:08 UTC (permalink / raw)
To: Sairaj Kodilkar, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 6/13/25 4:46 AM, Sairaj Kodilkar wrote:
>
>
> On 5/31/2025 3:00 AM, Alejandro Jimenez wrote:
>> Hey Sairaj,
>>
>> On 5/29/25 2:16 AM, Sairaj Kodilkar wrote:
>>>
>>>
>>> On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
>>>> On reset, restore the default address translation mode for all the
>>>> address spaces managed by the vIOMMU.
>>>>
>>>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>>> ---
>>>> hw/i386/amd_iommu.c | 28 ++++++++++++++++++++++++++++
>>>> 1 file changed, 28 insertions(+)
>>>>
>>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>>>> index 71018d70dd10..90491367594b 100644
>>>> --- a/hw/i386/amd_iommu.c
>>>> +++ b/hw/i386/amd_iommu.c
>>>> @@ -962,6 +962,33 @@ static void
>>>> amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
>>>> }
>>>> }
>>>> +/*
>>>> + * For all existing address spaces managed by the IOMMU, enable/
>>>> disable the
>>>> + * corresponding memory regions depending on the address
>>>> translation mode
>>>> + * as determined by the global and individual address space settings.
>>>> + */
>>>> +static void amdvi_switch_address_space_all(AMDVIState *s)
>>>> +{
>>>> + AMDVIAddressSpace **iommu_as;
>>>> +
>>>> + for (int bus_num = 0; bus_num < PCI_BUS_MAX; bus_num++) {
>>>> +
>>>> + /* Nothing to do if there are no devices on the current bus */
>>>> + if (!s->address_spaces[bus_num]) {
>>>> + continue;
>>>> + }
>>>> + iommu_as = s->address_spaces[bus_num];
>>>> +
>>>> + for (int devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
>>>> +
>>>> + if (!iommu_as[devfn]) {
>>>> + continue;
>>>> + }
>>>> + amdvi_switch_address_space(iommu_as[devfn]);
>>>> + }
>>>> + }
>>>> +}
>>>> +
>>>> /* log error without aborting since linux seems to be using
>>>> reserved bits */
>>>> static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
>>>> {
>>>> @@ -2199,6 +2226,7 @@ static void amdvi_sysbus_reset(DeviceState *dev)
>>>> /* Discard all mappings on device reset */
>>>> amdvi_address_space_unmap_all(s);
>>>> + amdvi_switch_address_space_all(s);
>>>
>>> Hi Alejandro
>>>
>>> I think amdvi_sysbus_reset should set iommu_as->addr_translation flag to
>>> "false" before switching all the address spaces. Without this, the
>>> devices will keep using IOMMU address space.
>>>
>> My first impulse is to agree with you, from the standpoint of
>> considering the no_dma mode as the "default mode", and a reset should
>> bring us back to default. But I wonder that is necessarily the
>> architectural behavior...
>>
>> After a reset, in order for any device transactions to be processed, a
>> guest driver must reinitialize the IOMMU data structures, including
>> the Device Table and specifically the table entry for the device. That
>> must trigger a INVAL_DEVTAB_ENTRY that will be intercepted and as-
>> >addr_translation will be set correctly. If the guest driver doesn't
>> do these operations, then a device won't be able to use the IOMMU
>> because it doesn't have a valid DTE, right? The earlier mappings were
>> already dropped, so it doesn't affect the host.
>>
>> Again, I see your point, and making this change is likely the right
>> thing to do, which is why I'll make the change for v3. Just wondering
>> if implementing such behavior is actually architecturally accurate or
>> just the "common sense" approach...
>>
>> Thank you for your attention to detail and all the valuable feedback.
>> I will be out next week, and will send v3 once I am back online.
>>
>> Alejandro
>>
>
> Hey
>
> I tested current patches on OVMF and reboot crashes with IO_PAGE_FAULT
> logs on _host_. But setting "addr_translation=false" fixes this crash.
>
That is odd, I have been using OVMF since the beginning with no issues,
so I'd have to ask you more questions about your setup to figure out
what the difference is since I cannot reproduce this specific issue.
> I think the reason is that, kernel does not reset DTE while shutting
> down the system. This keeps the stale mappings (IOVA->SPA) still in the
> IOMMU and OVMF initiates some DMA operations on device before guest
> reboots and sets up the new mappings.
>
On reset, the call to
amdvi_sysbus_reset()
amdvi_address_space_unmap_all()
amdvi_address_space_unmap()
is intended to avoid this problem.
I have found that on reboot AND when using forcedac=1, there are
mappings for the upper end of the IOVA space that are not dropped (this
happens after fixing the integer overflow issue you pointed out). I am
looking into it now, and once it is fixed I'd appreciate it if you could
try to reproduce this scenario again, as the underlying cause might be
the same.
Alejandro
> Thanks
> Sairaj
>
>>> Regards
>>> Sairaj Kodilkar
>>>> }
>>>> static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
>>>
>>
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 17/20] amd_iommu: Add dma-remap property to AMD vIOMMU device
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (15 preceding siblings ...)
2025-05-02 2:16 ` [PATCH v2 16/20] amd_iommu: Set all address spaces to default translation mode on reset Alejandro Jimenez
@ 2025-05-02 2:16 ` Alejandro Jimenez
2025-05-02 2:16 ` [PATCH v2 18/20] amd_iommu: Toggle address translation mode on devtab entry invalidation Alejandro Jimenez
` (5 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:16 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
In order to enable device assignment with IOMMU protection and guest DMA
address translation, IOMMU MAP notifier support is necessary to allow users
like VFIO to synchronize the shadow page tables i.e. to receive
notifications when the guest updates its I/O page tables and replay the
mappings onto host I/O page tables.
Provide a new dma-remap property to govern the ability to register for MAP
notifications, effectively providing global control over the DMA address
translation functionality that was implemented in previous changes.
Note that DMA remapping support also requires the vIOMMU is configured with
the NpCache capability, so a guest driver issues IOMMU invalidations for
both map() and unmap() operations. This capability is already set by default
and written to the configuration in amdvi_pci_realize() as part of
AMDVI_CAPAB_FEATURES.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 24 +++++++++++++++++-------
hw/i386/amd_iommu.h | 3 +++
2 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 90491367594b..a2df73062bf7 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -951,7 +951,9 @@ static void amdvi_iommu_address_space_sync_all(AMDVIState *s)
*/
static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
{
- if (amdvi_as->addr_translation) {
+ AMDVIState *s = amdvi_as->iommu_state;
+
+ if (s->dma_remap && amdvi_as->addr_translation) {
/* Enabling DMA region */
memory_region_set_enabled(&amdvi_as->iommu_nodma, false);
memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), true);
@@ -2126,12 +2128,19 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
AMDVIState *s = as->iommu_state;
- if (new & IOMMU_NOTIFIER_MAP) {
- error_setg(errp,
- "device %02x.%02x.%x requires iommu notifier which is not "
- "currently supported", as->bus_num, PCI_SLOT(as->devfn),
- PCI_FUNC(as->devfn));
- return -EINVAL;
+ /*
+ * Accurate synchronization of the vIOMMU page tables required to support
+ * MAP notifiers is provided by the dma-remap feature. In addition, this
+ * also requires that the vIOMMU presents the NpCache capability, so a guest
+ * driver issues invalidations for both map() and unmap() operations. The
+ * capability is already set by default as part of AMDVI_CAPAB_FEATURES and
+ * written to the configuration in amdvi_pci_realize().
+ */
+ if (!s->dma_remap && (new & IOMMU_NOTIFIER_MAP)) {
+ error_setg_errno(errp, ENOTSUP,
+ "device %02x.%02x.%x requires dma-remap=1",
+ as->bus_num, PCI_SLOT(as->devfn), PCI_FUNC(as->devfn));
+ return -ENOTSUP;
}
/*
@@ -2281,6 +2290,7 @@ static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
static const Property amdvi_properties[] = {
DEFINE_PROP_BOOL("xtsup", AMDVIState, xtsup, false),
+ DEFINE_PROP_BOOL("dma-remap", AMDVIState, dma_remap, false),
};
static const VMStateDescription vmstate_amdvi_sysbus = {
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index a7462b2adb79..fc4d2f7a4575 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -420,6 +420,9 @@ struct AMDVIState {
/* Interrupt remapping */
bool ga_enabled;
bool xtsup;
+
+ /* DMA address translation */
+ bool dma_remap;
};
uint64_t amdvi_extended_feature_register(AMDVIState *s);
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 18/20] amd_iommu: Toggle address translation mode on devtab entry invalidation
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (16 preceding siblings ...)
2025-05-02 2:16 ` [PATCH v2 17/20] amd_iommu: Add dma-remap property to AMD vIOMMU device Alejandro Jimenez
@ 2025-05-02 2:16 ` Alejandro Jimenez
2025-06-12 8:27 ` Ethan MILON
2025-05-02 2:16 ` [PATCH v2 19/20] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
` (4 subsequent siblings)
22 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:16 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
Device Table entry (DTE) e.g. after attaching a device and setting up its
DTE. When intercepting this event, determine if the DTE has been configured
for paging or not, and toggle the appropriate memory regions to allow DMA
address translation for the address space if needed. Requires dma-remap=on.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 78 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 76 insertions(+), 2 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index a2df73062bf7..75a92067f35f 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -991,18 +991,92 @@ static void amdvi_switch_address_space_all(AMDVIState *s)
}
}
+/*
+ * A guest driver must issue the INVALIDATE_DEVTAB_ENTRY command to the IOMMU
+ * after changing a Device Table entry. We can use this fact to detect when a
+ * Device Table entry is created for a device attached to a paging domain and
+ * enable the corresponding IOMMU memory region to allow for DMA translation if
+ * appropriate.
+ */
+static void amdvi_update_addr_translation_mode(AMDVIState *s, uint16_t devid)
+{
+ uint8_t bus_num, devfn, dte_mode;
+ AMDVIAddressSpace *as;
+ uint64_t dte[4] = { 0 };
+ IOMMUNotifier *n;
+ int ret;
+
+ /*
+ * Convert the devid encoded in the command to a bus and devfn in
+ * order to retrieve the corresponding address space.
+ */
+ bus_num = PCI_BUS_NUM(devid);
+ devfn = devid & 0xff;
+
+ /*
+ * The main buffer of size (AMDVIAddressSpace *) * (PCI_BUS_MAX) has already
+ * been allocated within AMDVIState, but must be careful to not access
+ * unallocated devfn.
+ */
+ if (!s->address_spaces[bus_num] || !s->address_spaces[bus_num][devfn]) {
+ return;
+ }
+ as = s->address_spaces[bus_num][devfn];
+
+ ret = amdvi_as_to_dte(as, dte);
+
+ if (!ret) {
+ dte_mode = (dte[0] >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
+ }
+
+ if ((ret < 0) || (!ret && !dte_mode)) {
+ /*
+ * The DTE could not be retrieved, it is not valid, or it is not setup
+ * for paging. In either case, ensure that if paging was previously in
+ * use then invalidate all existing mappings and then switch to use the
+ * no_dma memory region.
+ */
+ if (as->addr_translation) {
+ as->addr_translation = false;
+
+ IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+ amdvi_address_space_unmap(as, n);
+ }
+ amdvi_switch_address_space(as);
+ }
+ } else if (!as->addr_translation) {
+ /*
+ * Installing a DTE that enables translation where it wasn't previously
+ * active. Activate the DMA memory region.
+ */
+ as->addr_translation = true;
+ amdvi_switch_address_space(as);
+ amdvi_address_space_sync(as);
+ }
+}
+
/* log error without aborting since linux seems to be using reserved bits */
static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
{
uint16_t devid = cpu_to_le16((uint16_t)extract64(cmd[0], 0, 16));
+ trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
+ PCI_FUNC(devid));
+
/* This command should invalidate internal caches of which there isn't */
if (extract64(cmd[0], 16, 44) || cmd[1]) {
amdvi_log_illegalcom_error(s, extract64(cmd[0], 60, 4),
s->cmdbuf + s->cmdbuf_head);
+ return;
+ }
+
+ /*
+ * When DMA remapping capability is enabled, check if updated DTE is setup
+ * for paging or not, and configure the corresponding memory regions.
+ */
+ if (s->dma_remap) {
+ amdvi_update_addr_translation_mode(s, devid);
}
- trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
- PCI_FUNC(devid));
}
static void amdvi_complete_ppr(AMDVIState *s, uint64_t *cmd)
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 18/20] amd_iommu: Toggle address translation mode on devtab entry invalidation
2025-05-02 2:16 ` [PATCH v2 18/20] amd_iommu: Toggle address translation mode on devtab entry invalidation Alejandro Jimenez
@ 2025-06-12 8:27 ` Ethan MILON
2025-06-12 11:23 ` Sairaj Kodilkar
0 siblings, 1 reply; 54+ messages in thread
From: Ethan MILON @ 2025-06-12 8:27 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel@nongnu.org
Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
alex.williamson@redhat.com, vasant.hegde@amd.com,
suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
sarunkod@amd.com, Wei.Huang2@amd.com, CLEMENT MATHIEU--DRIF,
joao.m.martins@oracle.com, boris.ostrovsky@oracle.com
Hi,
On 5/2/25 4:16 AM, Alejandro Jimenez wrote:
> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
>
>
> A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
> Device Table entry (DTE) e.g. after attaching a device and setting up its
> DTE. When intercepting this event, determine if the DTE has been configured
> for paging or not, and toggle the appropriate memory regions to allow DMA
> address translation for the address space if needed. Requires dma-remap=on.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 78 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 76 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index a2df73062bf7..75a92067f35f 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -991,18 +991,92 @@ static void amdvi_switch_address_space_all(AMDVIState *s)
> }
> }
>
> +/*
> + * A guest driver must issue the INVALIDATE_DEVTAB_ENTRY command to the IOMMU
> + * after changing a Device Table entry. We can use this fact to detect when a
> + * Device Table entry is created for a device attached to a paging domain and
> + * enable the corresponding IOMMU memory region to allow for DMA translation if
> + * appropriate.
> + */
> +static void amdvi_update_addr_translation_mode(AMDVIState *s, uint16_t devid)
> +{
> + uint8_t bus_num, devfn, dte_mode;
> + AMDVIAddressSpace *as;
> + uint64_t dte[4] = { 0 };
> + IOMMUNotifier *n;
> + int ret;
> +
> + /*
> + * Convert the devid encoded in the command to a bus and devfn in
> + * order to retrieve the corresponding address space.
> + */
> + bus_num = PCI_BUS_NUM(devid);
> + devfn = devid & 0xff;
> +
> + /*
> + * The main buffer of size (AMDVIAddressSpace *) * (PCI_BUS_MAX) has already
> + * been allocated within AMDVIState, but must be careful to not access
> + * unallocated devfn.
> + */
> + if (!s->address_spaces[bus_num] || !s->address_spaces[bus_num][devfn]) {
> + return;
> + }
> + as = s->address_spaces[bus_num][devfn];
> +
> + ret = amdvi_as_to_dte(as, dte);
> +
> + if (!ret) {
> + dte_mode = (dte[0] >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
> + }
> +
> + if ((ret < 0) || (!ret && !dte_mode)) {
> + /*
> + * The DTE could not be retrieved, it is not valid, or it is not setup
> + * for paging. In either case, ensure that if paging was previously in
> + * use then invalidate all existing mappings and then switch to use the
> + * no_dma memory region.
> + */
If the DTE is malformed or could not be retrieved, wouldn't it be safer
to default to the DMA region rather than falling back to direct access?
Or am I missing something?
Thanks,
Ethan
> + if (as->addr_translation) {
> + as->addr_translation = false;
> +
> + IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
> + amdvi_address_space_unmap(as, n);
> + }
> + amdvi_switch_address_space(as);
> + }
> + } else if (!as->addr_translation) {
> + /*
> + * Installing a DTE that enables translation where it wasn't previously
> + * active. Activate the DMA memory region.
> + */
> + as->addr_translation = true;
> + amdvi_switch_address_space(as);
> + amdvi_address_space_sync(as);
> + }
> +}
> +
> /* log error without aborting since linux seems to be using reserved bits */
> static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
> {
> uint16_t devid = cpu_to_le16((uint16_t)extract64(cmd[0], 0, 16));
>
> + trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
> + PCI_FUNC(devid));
> +
> /* This command should invalidate internal caches of which there isn't */
> if (extract64(cmd[0], 16, 44) || cmd[1]) {
> amdvi_log_illegalcom_error(s, extract64(cmd[0], 60, 4),
> s->cmdbuf + s->cmdbuf_head);
> + return;
> + }
> +
> + /*
> + * When DMA remapping capability is enabled, check if updated DTE is setup
> + * for paging or not, and configure the corresponding memory regions.
> + */
> + if (s->dma_remap) {
> + amdvi_update_addr_translation_mode(s, devid);
> }
> - trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
> - PCI_FUNC(devid));
> }
>
> static void amdvi_complete_ppr(AMDVIState *s, uint64_t *cmd)
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 18/20] amd_iommu: Toggle address translation mode on devtab entry invalidation
2025-06-12 8:27 ` Ethan MILON
@ 2025-06-12 11:23 ` Sairaj Kodilkar
0 siblings, 0 replies; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-06-12 11:23 UTC (permalink / raw)
To: Ethan MILON, Alejandro Jimenez, qemu-devel@nongnu.org
Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
alex.williamson@redhat.com, vasant.hegde@amd.com,
suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
Wei.Huang2@amd.com, CLEMENT MATHIEU--DRIF,
joao.m.martins@oracle.com, boris.ostrovsky@oracle.com
On 6/12/2025 1:57 PM, Ethan MILON wrote:
> Hi,
>
> On 5/2/25 4:16 AM, Alejandro Jimenez wrote:
>> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
>>
>>
>> A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
>> Device Table entry (DTE) e.g. after attaching a device and setting up its
>> DTE. When intercepting this event, determine if the DTE has been configured
>> for paging or not, and toggle the appropriate memory regions to allow DMA
>> address translation for the address space if needed. Requires dma-remap=on.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>> hw/i386/amd_iommu.c | 78 +++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 76 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index a2df73062bf7..75a92067f35f 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -991,18 +991,92 @@ static void amdvi_switch_address_space_all(AMDVIState *s)
>> }
>> }
>>
>> +/*
>> + * A guest driver must issue the INVALIDATE_DEVTAB_ENTRY command to the IOMMU
>> + * after changing a Device Table entry. We can use this fact to detect when a
>> + * Device Table entry is created for a device attached to a paging domain and
>> + * enable the corresponding IOMMU memory region to allow for DMA translation if
>> + * appropriate.
>> + */
>> +static void amdvi_update_addr_translation_mode(AMDVIState *s, uint16_t devid)
>> +{
>> + uint8_t bus_num, devfn, dte_mode;
>> + AMDVIAddressSpace *as;
>> + uint64_t dte[4] = { 0 };
>> + IOMMUNotifier *n;
>> + int ret;
>> +
>> + /*
>> + * Convert the devid encoded in the command to a bus and devfn in
>> + * order to retrieve the corresponding address space.
>> + */
>> + bus_num = PCI_BUS_NUM(devid);
>> + devfn = devid & 0xff;
>> +
>> + /*
>> + * The main buffer of size (AMDVIAddressSpace *) * (PCI_BUS_MAX) has already
>> + * been allocated within AMDVIState, but must be careful to not access
>> + * unallocated devfn.
>> + */
>> + if (!s->address_spaces[bus_num] || !s->address_spaces[bus_num][devfn]) {
>> + return;
>> + }
>> + as = s->address_spaces[bus_num][devfn];
>> +
>> + ret = amdvi_as_to_dte(as, dte);
>> +
>> + if (!ret) {
>> + dte_mode = (dte[0] >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
>> + }
>> +
>> + if ((ret < 0) || (!ret && !dte_mode)) {
>> + /*
>> + * The DTE could not be retrieved, it is not valid, or it is not setup
>> + * for paging. In either case, ensure that if paging was previously in
>> + * use then invalidate all existing mappings and then switch to use the
>> + * no_dma memory region.
>> + */
>
> If the DTE is malformed or could not be retrieved, wouldn't it be safer
> to default to the DMA region rather than falling back to direct access?
> Or am I missing something?
>
Hi Ethan
I also agree with you.
I think right way should be keep the IOMMU DMA region if error codes are
AMDVI_FR_DTE_RTR_ERR and AMDVI_FR_DTE_TV. Otherwise when error code is
AMDVI_FR_DTE_V, switch to direct access.
@vasant what do you think ?
Thanks
Sairaj
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 19/20] amd_iommu: Do not assume passthrough translation when DTE[TV]=0
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (17 preceding siblings ...)
2025-05-02 2:16 ` [PATCH v2 18/20] amd_iommu: Toggle address translation mode on devtab entry invalidation Alejandro Jimenez
@ 2025-05-02 2:16 ` Alejandro Jimenez
2025-05-12 7:00 ` Sairaj Kodilkar
2025-05-02 2:16 ` [PATCH v2 20/20] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
` (3 subsequent siblings)
22 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:16 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
The AMD I/O Virtualization Technology (IOMMU) Specification (see Table
8: V, TV, and GV Fields in Device Table Entry), specifies that a DTE
with V=1, TV=0 does not contain a valid address translation information.
If a request requires a table walk, the walk is terminated when this
condition is encountered.
Do not assume that addresses for a device with DTE[TV]=0 are passed
through (i.e. not remapped) and instead terminate the page table walk
early.
Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 87 +++++++++++++++++++++++++--------------------
1 file changed, 48 insertions(+), 39 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 75a92067f35f..6d1e7cc65f83 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1612,51 +1612,60 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
uint64_t pte = dte[0], pte_addr, page_mask;
/* make sure the DTE has TV = 1 */
- if (pte & AMDVI_DEV_TRANSLATION_VALID) {
- level = get_pte_translation_mode(pte);
- if (level >= 7) {
- trace_amdvi_mode_invalid(level, addr);
+ if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
+ /*
+ * A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
+ * An IOMMU processing a request that requires a table walk terminates
+ * the walk when it encounters this condition. Do the same and return
+ * instead of assuming that the address is forwarded without translation
+ * i.e. the passthrough case, as it is done for the case where DTE[V]=0.
+ */
+ return;
+ }
+
+ level = get_pte_translation_mode(pte);
+ if (level >= 7) {
+ trace_amdvi_mode_invalid(level, addr);
+ return;
+ }
+ if (level == 0) {
+ goto no_remap;
+ }
+
+ /* we are at the leaf page table or page table encodes a huge page */
+ do {
+ pte_perms = amdvi_get_perms(pte);
+ present = pte & 1;
+ if (!present || perms != (perms & pte_perms)) {
+ amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
+ trace_amdvi_page_fault(addr);
return;
}
- if (level == 0) {
- goto no_remap;
+ /* go to the next lower level */
+ pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
+ /* add offset and load pte */
+ pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
+ pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+ if (!pte) {
+ return;
}
+ oldlevel = level;
+ level = get_pte_translation_mode(pte);
+ } while (level > 0 && level < 7);
- /* we are at the leaf page table or page table encodes a huge page */
- do {
- pte_perms = amdvi_get_perms(pte);
- present = pte & 1;
- if (!present || perms != (perms & pte_perms)) {
- amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
- trace_amdvi_page_fault(addr);
- return;
- }
-
- /* go to the next lower level */
- pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
- /* add offset and load pte */
- pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
- pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
- if (!pte || (pte == (uint64_t)-1)) {
- return;
- }
- oldlevel = level;
- level = get_pte_translation_mode(pte);
- } while (level > 0 && level < 7);
+ if (level == 0x7) {
+ page_mask = pte_override_page_mask(pte);
+ } else {
+ page_mask = pte_get_page_mask(oldlevel);
+ }
- if (level == 0x7) {
- page_mask = pte_override_page_mask(pte);
- } else {
- page_mask = pte_get_page_mask(oldlevel);
- }
+ /* get access permissions from pte */
+ ret->iova = addr & page_mask;
+ ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
+ ret->addr_mask = ~page_mask;
+ ret->perm = amdvi_get_perms(pte);
+ return;
- /* get access permissions from pte */
- ret->iova = addr & page_mask;
- ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
- ret->addr_mask = ~page_mask;
- ret->perm = amdvi_get_perms(pte);
- return;
- }
no_remap:
ret->iova = addr & AMDVI_PAGE_MASK_4K;
ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 19/20] amd_iommu: Do not assume passthrough translation when DTE[TV]=0
2025-05-02 2:16 ` [PATCH v2 19/20] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
@ 2025-05-12 7:00 ` Sairaj Kodilkar
2025-05-14 21:49 ` Alejandro Jimenez
0 siblings, 1 reply; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-12 7:00 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
> The AMD I/O Virtualization Technology (IOMMU) Specification (see Table
> 8: V, TV, and GV Fields in Device Table Entry), specifies that a DTE
> with V=1, TV=0 does not contain a valid address translation information.
> If a request requires a table walk, the walk is terminated when this
> condition is encountered.
>
> Do not assume that addresses for a device with DTE[TV]=0 are passed
> through (i.e. not remapped) and instead terminate the page table walk
> early.
>
> Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU")
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
> hw/i386/amd_iommu.c | 87 +++++++++++++++++++++++++--------------------
> 1 file changed, 48 insertions(+), 39 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 75a92067f35f..6d1e7cc65f83 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1612,51 +1612,60 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
> uint64_t pte = dte[0], pte_addr, page_mask;
>
> /* make sure the DTE has TV = 1 */
> - if (pte & AMDVI_DEV_TRANSLATION_VALID) {
> - level = get_pte_translation_mode(pte);
> - if (level >= 7) {
> - trace_amdvi_mode_invalid(level, addr);
> + if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
> + /*
> + * A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
> + * An IOMMU processing a request that requires a table walk terminates
> + * the walk when it encounters this condition. Do the same and return
> + * instead of assuming that the address is forwarded without translation
> + * i.e. the passthrough case, as it is done for the case where DTE[V]=0.
> + */
> + return;
> + }
Above change seems redundant since caller of the amdvi_page_walk(),
amdvi_do_translate() checks the return value of amdvi_as_to_dte().
amdvi_do_translate() returns when it encounters -AMDVI_FR_DTE_TV and
does not call amdvi_page_walk().
Regards
Sairaj Kodilkar
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 19/20] amd_iommu: Do not assume passthrough translation when DTE[TV]=0
2025-05-12 7:00 ` Sairaj Kodilkar
@ 2025-05-14 21:49 ` Alejandro Jimenez
2025-05-16 8:14 ` Sairaj Kodilkar
0 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-14 21:49 UTC (permalink / raw)
To: Sairaj Kodilkar, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/12/25 3:00 AM, Sairaj Kodilkar wrote:
>
>
> On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
>> - if (pte & AMDVI_DEV_TRANSLATION_VALID) {
>> - level = get_pte_translation_mode(pte);
>> - if (level >= 7) {
>> - trace_amdvi_mode_invalid(level, addr);
>> + if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
>> + /*
>> + * A DTE with V=1, TV=0 does not have a valid Page Table Root
>> Pointer.
>> + * An IOMMU processing a request that requires a table walk
>> terminates
>> + * the walk when it encounters this condition. Do the same
>> and return
>> + * instead of assuming that the address is forwarded without
>> translation
>> + * i.e. the passthrough case, as it is done for the case
>> where DTE[V]=0.
>> + */
>> + return;
>> + }
>
> Above change seems redundant since caller of the amdvi_page_walk(),
> amdvi_do_translate() checks the return value of amdvi_as_to_dte().
> amdvi_do_translate() returns when it encounters -AMDVI_FR_DTE_TV and
> does not call amdvi_page_walk().
>
It is perhaps redundant at this point for the reason you mention, as we
are already checking the DTE for all sorts of conditions earlier.
But I would like to leave it in since it serves as a good validation
that any future callers of amdvi_page_walk(), which might not check or
care about the return of amdvi_as_to_dte(), are passing a DTE that
specifically supports a page walk, which is exactly what this function
needs to proceed. Plus it provides a nice place to place the comment
detailing the IOMMU behavior in such cases.
If you are concerned about efficiency, this is already not a very fast
path, and the 5 assembly instructions this check takes (with a non-debug
build I think is ~2 instructions) will certainly not be what brings the
performance down :).
Alejandro
> Regards
> Sairaj Kodilkar
>
>
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 19/20] amd_iommu: Do not assume passthrough translation when DTE[TV]=0
2025-05-14 21:49 ` Alejandro Jimenez
@ 2025-05-16 8:14 ` Sairaj Kodilkar
0 siblings, 0 replies; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-16 8:14 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/15/2025 3:19 AM, Alejandro Jimenez wrote:
>
>
> On 5/12/25 3:00 AM, Sairaj Kodilkar wrote:
>>
>>
>> On 5/2/2025 7:46 AM, Alejandro Jimenez wrote:
>
>>> - if (pte & AMDVI_DEV_TRANSLATION_VALID) {
>>> - level = get_pte_translation_mode(pte);
>>> - if (level >= 7) {
>>> - trace_amdvi_mode_invalid(level, addr);
>>> + if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
>>> + /*
>>> + * A DTE with V=1, TV=0 does not have a valid Page Table
>>> Root Pointer.
>>> + * An IOMMU processing a request that requires a table walk
>>> terminates
>>> + * the walk when it encounters this condition. Do the same
>>> and return
>>> + * instead of assuming that the address is forwarded without
>>> translation
>>> + * i.e. the passthrough case, as it is done for the case
>>> where DTE[V]=0.
>>> + */
>>> + return;
>>> + }
>>
>> Above change seems redundant since caller of the amdvi_page_walk(),
>> amdvi_do_translate() checks the return value of amdvi_as_to_dte().
>> amdvi_do_translate() returns when it encounters -AMDVI_FR_DTE_TV and
>> does not call amdvi_page_walk().
>>
>
> It is perhaps redundant at this point for the reason you mention, as we
> are already checking the DTE for all sorts of conditions earlier.
>
> But I would like to leave it in since it serves as a good validation
> that any future callers of amdvi_page_walk(), which might not check or
> care about the return of amdvi_as_to_dte(), are passing a DTE that
> specifically supports a page walk, which is exactly what this function
> needs to proceed. Plus it provides a nice place to place the comment
> detailing the IOMMU behavior in such cases.
Yes, agree with you !
>
> If you are concerned about efficiency, this is already not a very fast
> path, and the 5 assembly instructions this check takes (with a non-debug
> build I think is ~2 instructions) will certainly not be what brings the
> performance down :).
>
No I was not concerned with efficiency. Thought you missed that there is
another check :).
> Alejandro
>
>> Regards
>> Sairaj Kodilkar
>>
>>
>>
>
Regards
Sairaj Kodilkar
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 20/20] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (18 preceding siblings ...)
2025-05-02 2:16 ` [PATCH v2 19/20] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
@ 2025-05-02 2:16 ` Alejandro Jimenez
2025-05-11 18:34 ` [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Michael S. Tsirkin
` (2 subsequent siblings)
22 siblings, 0 replies; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-02 2:16 UTC (permalink / raw)
To: qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky, alejandro.j.jimenez
Simplify amdvi_page_walk() by making it call the fetch_pte() helper that is
already in use by the shadow page synchronization code. Ensures all code
uses the same page table walking algorithm.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
hw/i386/amd_iommu.c | 59 +++++++++++++++++++++------------------------
1 file changed, 27 insertions(+), 32 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 6d1e7cc65f83..ab236a8e016d 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1608,11 +1608,13 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
IOMMUTLBEntry *ret, unsigned perms,
hwaddr addr)
{
- unsigned level, present, pte_perms, oldlevel;
- uint64_t pte = dte[0], pte_addr, page_mask;
+ hwaddr page_mask, pagesize = 0;
+ uint8_t mode;
+ uint64_t pte;
+ int fetch_ret;
/* make sure the DTE has TV = 1 */
- if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
+ if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
/*
* A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
* An IOMMU processing a request that requires a table walk terminates
@@ -1623,42 +1625,35 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
return;
}
- level = get_pte_translation_mode(pte);
- if (level >= 7) {
- trace_amdvi_mode_invalid(level, addr);
+ mode = get_pte_translation_mode(dte[0]);
+ if (mode >= 7) {
+ trace_amdvi_mode_invalid(mode, addr);
return;
}
- if (level == 0) {
+ if (mode == 0) {
goto no_remap;
}
- /* we are at the leaf page table or page table encodes a huge page */
- do {
- pte_perms = amdvi_get_perms(pte);
- present = pte & 1;
- if (!present || perms != (perms & pte_perms)) {
- amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
- trace_amdvi_page_fault(addr);
- return;
- }
- /* go to the next lower level */
- pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
- /* add offset and load pte */
- pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
- pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
- if (!pte) {
- return;
- }
- oldlevel = level;
- level = get_pte_translation_mode(pte);
- } while (level > 0 && level < 7);
+ /* Attempt to fetch the PTE to determine if a valid mapping exists */
+ fetch_ret = fetch_pte(as, addr, dte[0], &pte, &pagesize);
- if (level == 0x7) {
- page_mask = pte_override_page_mask(pte);
- } else {
- page_mask = pte_get_page_mask(oldlevel);
+ /*
+ * If walking the page table results in an error of any type, returns an
+ * empty PTE i.e. no mapping, or the permissions do not match, return since
+ * there is no translation available.
+ */
+ if (fetch_ret < 0 || !IOMMU_PTE_PRESENT(pte) ||
+ perms != (perms & amdvi_get_perms(pte))) {
+
+ amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
+ trace_amdvi_page_fault(addr);
+ return;
}
+ /* A valid PTE and page size has been retrieved */
+ assert(pagesize);
+ page_mask = ~(pagesize - 1);
+
/* get access permissions from pte */
ret->iova = addr & page_mask;
ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
@@ -1670,7 +1665,7 @@ no_remap:
ret->iova = addr & AMDVI_PAGE_MASK_4K;
ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
ret->addr_mask = ~AMDVI_PAGE_MASK_4K;
- ret->perm = amdvi_get_perms(pte);
+ ret->perm = amdvi_get_perms(dte[0]);
}
static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
--
2.43.5
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (19 preceding siblings ...)
2025-05-02 2:16 ` [PATCH v2 20/20] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
@ 2025-05-11 18:34 ` Michael S. Tsirkin
2025-05-16 8:07 ` Sairaj Kodilkar
2025-05-30 11:41 ` Michael S. Tsirkin
22 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2025-05-11 18:34 UTC (permalink / raw)
To: Alejandro Jimenez
Cc: qemu-devel, pbonzini, richard.henderson, eduardo, peterx, david,
philmd, marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On Fri, May 02, 2025 at 02:15:45AM +0000, Alejandro Jimenez wrote:
> This series adds support for guests using the AMD vIOMMU to enable DMA
> remapping for VFIO devices. In addition to the currently supported
> passthrough (PT) mode, guest kernels are now able to to provide DMA
> address translation and access permission checking to VFs attached to
> paging domains, using the AMD v1 I/O page table format.
>
> Please see v1[0] cover letter for additional details such as example
> QEMU command line parameters used in testing.
Suravee you are active there, maybe you can take a look at AMD bits, too?
> Changes since v1[0]:
> - Added documentation entry for '-device amd-iommu'
> - Code movement with no functional changes to avoid use of forward
> declarations in later patches [Sairaj, mst]
> - Moved addr_translation and dma-remap property to separate commits.
> The dma-remap feature is only available for users to enable after
> all required functionality is implemented [Sairaj]
> - Explicit initialization of significant fields like addr_translation
> and notifier_flags [Sairaj]
> - Fixed bug in decoding of invalidation size [Sairaj]
> - Changed fetch_pte() to use an out parameter for pte, and be able to
> check for error conditions via negative return value [Clement]
> - Removed UNMAP-only notifier optimization, leaving vhost support for
> later series [Sairaj]
> - Fixed ordering between address space unmap and memory region activation
> on devtab invalidation [Sairaj]
> - Fixed commit message with "V=1, TV=0" [Sairaj]
> - Dropped patch removing the page_fault event. That area is better
> addressed in separate series.
> - Independent testing by Sairaj (thank you!)
>
> Thank you,
> Alejandro
>
> [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/
>
> Alejandro Jimenez (20):
> memory: Adjust event ranges to fit within notifier boundaries
> amd_iommu: Document '-device amd-iommu' common options
> amd_iommu: Reorder device and page table helpers
> amd_iommu: Helper to decode size of page invalidation command
> amd_iommu: Add helper function to extract the DTE
> amd_iommu: Return an error when unable to read PTE from guest memory
> amd_iommu: Add helpers to walk AMD v1 Page Table format
> amd_iommu: Add a page walker to sync shadow page tables on
> invalidation
> amd_iommu: Add basic structure to support IOMMU notifier updates
> amd_iommu: Sync shadow page tables on page invalidation
> amd_iommu: Use iova_tree records to determine large page size on UNMAP
> amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
> amd_iommu: Add replay callback
> amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
> amd_iommu: Toggle memory regions based on address translation mode
> amd_iommu: Set all address spaces to default translation mode on reset
> amd_iommu: Add dma-remap property to AMD vIOMMU device
> amd_iommu: Toggle address translation mode on devtab entry
> invalidation
> amd_iommu: Do not assume passthrough translation when DTE[TV]=0
> amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>
> hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++-------
> hw/i386/amd_iommu.h | 52 +++
> qemu-options.hx | 23 +
> system/memory.c | 10 +-
> 4 files changed, 934 insertions(+), 156 deletions(-)
>
>
> base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf
> --
> 2.43.5
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (20 preceding siblings ...)
2025-05-11 18:34 ` [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Michael S. Tsirkin
@ 2025-05-16 8:07 ` Sairaj Kodilkar
2025-05-21 2:35 ` Alejandro Jimenez
2025-05-30 11:41 ` Michael S. Tsirkin
22 siblings, 1 reply; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-16 8:07 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> This series adds support for guests using the AMD vIOMMU to enable DMA
> remapping for VFIO devices. In addition to the currently supported
> passthrough (PT) mode, guest kernels are now able to to provide DMA
> address translation and access permission checking to VFs attached to
> paging domains, using the AMD v1 I/O page table format.
>
> Please see v1[0] cover letter for additional details such as example
> QEMU command line parameters used in testing.
>
> Changes since v1[0]:
> - Added documentation entry for '-device amd-iommu'
> - Code movement with no functional changes to avoid use of forward
> declarations in later patches [Sairaj, mst]
> - Moved addr_translation and dma-remap property to separate commits.
> The dma-remap feature is only available for users to enable after
> all required functionality is implemented [Sairaj]
> - Explicit initialization of significant fields like addr_translation
> and notifier_flags [Sairaj]
> - Fixed bug in decoding of invalidation size [Sairaj]
> - Changed fetch_pte() to use an out parameter for pte, and be able to
> check for error conditions via negative return value [Clement]
> - Removed UNMAP-only notifier optimization, leaving vhost support for
> later series [Sairaj]
> - Fixed ordering between address space unmap and memory region activation
> on devtab invalidation [Sairaj]
> - Fixed commit message with "V=1, TV=0" [Sairaj]
> - Dropped patch removing the page_fault event. That area is better
> addressed in separate series.
> - Independent testing by Sairaj (thank you!)
>
> Thank you,
> Alejandro
>
> [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/
>
> Alejandro Jimenez (20):
> memory: Adjust event ranges to fit within notifier boundaries
> amd_iommu: Document '-device amd-iommu' common options
> amd_iommu: Reorder device and page table helpers
> amd_iommu: Helper to decode size of page invalidation command
> amd_iommu: Add helper function to extract the DTE
> amd_iommu: Return an error when unable to read PTE from guest memory
> amd_iommu: Add helpers to walk AMD v1 Page Table format
> amd_iommu: Add a page walker to sync shadow page tables on
> invalidation
> amd_iommu: Add basic structure to support IOMMU notifier updates
> amd_iommu: Sync shadow page tables on page invalidation
> amd_iommu: Use iova_tree records to determine large page size on UNMAP
> amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
> amd_iommu: Add replay callback
> amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
> amd_iommu: Toggle memory regions based on address translation mode
> amd_iommu: Set all address spaces to default translation mode on reset
> amd_iommu: Add dma-remap property to AMD vIOMMU device
> amd_iommu: Toggle address translation mode on devtab entry
> invalidation
> amd_iommu: Do not assume passthrough translation when DTE[TV]=0
> amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>
> hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++-------
> hw/i386/amd_iommu.h | 52 +++
> qemu-options.hx | 23 +
> system/memory.c | 10 +-
> 4 files changed, 934 insertions(+), 156 deletions(-)
>
>
> base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf
Hi Alejandro,
Tested the v2, everything looks good when I boot guest with upstream
kernel. But I observed that NVME driver fails to load with guest kernel
version 4.15.0-213-generic. This is the default kernel that comes with
the ubuntu image.
This is what I see in the dmesg
[ 26.702381] nvme nvme0: pci function 0000:00:04.0
[ 26.817847] nvme nvme0: missing or invalid SUBNQN field.
I am using following command qemu command line
-enable-kvm -m 10G -smp cpus=$NUM_VCPUS \
-device amd-iommu,dma-remap=on \
-netdev user,id=USER0,hostfwd=tcp::3333-:22 \
-device
virtio-net-pci,id=vnet0,iommu_platform=on,disable-legacy=on,romfile=,netdev=USER0
\
-cpu
EPYC-Genoa,x2apic=on,kvm-msi-ext-dest-id=on,+kvm-pv-unhalt,kvm-pv-tlb-flush,kvm-pv-ipi,kvm-pv-sched-yield
\
-name guest=my-vm,debug-threads=on \
-machine q35,kernel_irqchip=split \
-global kvm-pit.lost_tick_policy=discard \
-nographic -vga none -chardev stdio,id=STDIO0,signal=off,mux=on \
-device isa-serial,id=isa-serial0,chardev=STDIO0 \
-smbios type=0,version=2.8 \
-blockdev
node-name=drive0,driver=qcow2,file.driver=file,file.filename=$IMG \
-device virtio-blk-pci,num-queues=8,drive=drive0 \
-chardev socket,id=SOCKET1,server=on,wait=off,path=qemu.mon.user3333 \
-mon chardev=SOCKET1,mode=control \
-device vfio-pci,host=0000:44:00.0
Do you have any idea what might trigger this.
I see the error only when I am using emulated AMD IOMMU with passthrough
device. Regular passthrough works fine.
Regards
Sairaj Kodilkar
P.S. I know that the guest kernel is quite old but still wanted to make
you aware.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices
2025-05-16 8:07 ` Sairaj Kodilkar
@ 2025-05-21 2:35 ` Alejandro Jimenez
2025-05-21 6:21 ` Sairaj Kodilkar
0 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-21 2:35 UTC (permalink / raw)
To: Sairaj Kodilkar, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
Hi Sairaj
On 5/16/25 4:07 AM, Sairaj Kodilkar wrote:
>
>
> On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
> Hi Alejandro,
>
> Tested the v2, everything looks good when I boot guest with upstream
> kernel. But I observed that NVME driver fails to load with guest kernel
> version 4.15.0-213-generic. This is the default kernel that comes with
> the ubuntu image.
Thank you for the additional testing and for the report. I wanted to
investigate and if possible solve the issue before replying, but since
it is taking me some time I wanted to ACK your message. Minor comments
below...
>
> This is what I see in the dmesg
>
> [ 26.702381] nvme nvme0: pci function 0000:00:04.0
> [ 26.817847] nvme nvme0: missing or invalid SUBNQN field.
There are multiple reports of that warning which would indicate that is
not caused by an issue with the IOMMU emulation, but it is interesting
that you don't see it with "regular passthrough" (I assume that means
with guest kernel in pt mode).
>
> I am using following command qemu command line
>
> -enable-kvm -m 10G -smp cpus=$NUM_VCPUS \
> -device amd-iommu,dma-remap=on \
> -netdev user,id=USER0,hostfwd=tcp::3333-:22 \
> -device virtio-net-pci,id=vnet0,iommu_platform=on,disable-
> legacy=on,romfile=,netdev=USER0 \
> -cpu EPYC-Genoa,x2apic=on,kvm-msi-ext-dest-id=on,+kvm-pv-unhalt,kvm-pv-
> tlb-flush,kvm-pv-ipi,kvm-pv-sched-yield \
> -name guest=my-vm,debug-threads=on \
> -machine q35,kernel_irqchip=split \
> -global kvm-pit.lost_tick_policy=discard \
> -nographic -vga none -chardev stdio,id=STDIO0,signal=off,mux=on \
> -device isa-serial,id=isa-serial0,chardev=STDIO0 \
> -smbios type=0,version=2.8 \
> -blockdev node-
> name=drive0,driver=qcow2,file.driver=file,file.filename=$IMG \
> -device virtio-blk-pci,num-queues=8,drive=drive0 \
> -chardev socket,id=SOCKET1,server=on,wait=off,path=qemu.mon.user3333 \
> -mon chardev=SOCKET1,mode=control \
> -device vfio-pci,host=0000:44:00.0
>
> Do you have any idea what might trigger this.
There are some parameters above that are unnecessary and perhaps
conflicting e.g. we don't need kvm-msi-ext-dest-id=on since the vIOMMU
provides interrupt remapping (plus you are likely not using more than
255 vCPUs). We also don't need kvm-pit.lost_tick_policy when using split
irqchip, since the PIT is not emulated by KVM. But to be fair I don't
believe those are likely to be causing the problem...
My main suspicion is the guest IOMMU driver being too old and missing
lots of fixes, so it could be missing some essential operations that the
emulation requires to work. e.g. if the guest driver does not comply
with the spec and fails to issue a DEVTAB_INVALIDATE after changing the
DTE, the vIOMMU code never gets the chance to enable the IOMMU memory
region, and it all goes wrong from that point on.
But I need to reproduce the problem and figure out where/when the
emulation is failing. I've tested as far back as 5.15 based kernels.
I would argue that while it is something that I am definitely going to
address if possible, this issue should not be a blocker. I'll update as
soon as I have more data on the cause.
Thank you,
Alejandro
>
> I see the error only when I am using emulated AMD IOMMU with passthrough
> device. Regular passthrough works fine.
>
> Regards
> Sairaj Kodilkar
>
> P.S. I know that the guest kernel is quite old but still wanted to make
> you aware.
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices
2025-05-21 2:35 ` Alejandro Jimenez
@ 2025-05-21 6:21 ` Sairaj Kodilkar
0 siblings, 0 replies; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-05-21 6:21 UTC (permalink / raw)
To: Alejandro Jimenez, qemu-devel
Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/21/2025 8:05 AM, Alejandro Jimenez wrote:
> Hi Sairaj
>
> On 5/16/25 4:07 AM, Sairaj Kodilkar wrote:
>>
>>
>> On 5/2/2025 7:45 AM, Alejandro Jimenez wrote:
>
>> Hi Alejandro,
>>
>> Tested the v2, everything looks good when I boot guest with upstream
>> kernel. But I observed that NVME driver fails to load with guest kernel
>> version 4.15.0-213-generic. This is the default kernel that comes with
>> the ubuntu image.
>
> Thank you for the additional testing and for the report. I wanted to
> investigate and if possible solve the issue before replying, but since
> it is taking me some time I wanted to ACK your message. Minor comments
> below...
>>
>> This is what I see in the dmesg
>>
>> [ 26.702381] nvme nvme0: pci function 0000:00:04.0
>> [ 26.817847] nvme nvme0: missing or invalid SUBNQN field.
>
> There are multiple reports of that warning which would indicate that is
> not caused by an issue with the IOMMU emulation, but it is interesting
> that you don't see it with "regular passthrough" (I assume that means
> with guest kernel in pt mode).
>
Yep The "regular passthrough" is guest without amd-iommu or pt=on
>>
>> I am using following command qemu command line
>>
>> -enable-kvm -m 10G -smp cpus=$NUM_VCPUS \
>> -device amd-iommu,dma-remap=on \
>> -netdev user,id=USER0,hostfwd=tcp::3333-:22 \
>> -device virtio-net-pci,id=vnet0,iommu_platform=on,disable-
>> legacy=on,romfile=,netdev=USER0 \
>> -cpu EPYC-Genoa,x2apic=on,kvm-msi-ext-dest-id=on,+kvm-pv-unhalt,kvm-
>> pv- tlb-flush,kvm-pv-ipi,kvm-pv-sched-yield \
>> -name guest=my-vm,debug-threads=on \
>> -machine q35,kernel_irqchip=split \
>> -global kvm-pit.lost_tick_policy=discard \
>> -nographic -vga none -chardev stdio,id=STDIO0,signal=off,mux=on \
>> -device isa-serial,id=isa-serial0,chardev=STDIO0 \
>> -smbios type=0,version=2.8 \
>> -blockdev node-
>> name=drive0,driver=qcow2,file.driver=file,file.filename=$IMG \
>> -device virtio-blk-pci,num-queues=8,drive=drive0 \
>> -chardev socket,id=SOCKET1,server=on,wait=off,path=qemu.mon.user3333 \
>> -mon chardev=SOCKET1,mode=control \
>> -device vfio-pci,host=0000:44:00.0
>>
>> Do you have any idea what might trigger this.
>
> There are some parameters above that are unnecessary and perhaps
> conflicting e.g. we don't need kvm-msi-ext-dest-id=on since the vIOMMU
> provides interrupt remapping (plus you are likely not using more than
> 255 vCPUs). We also don't need kvm-pit.lost_tick_policy when using split
> irqchip, since the PIT is not emulated by KVM. But to be fair I don't
> believe those are likely to be causing the problem...
Thanks for letting me know, I'll update the script.
>
> My main suspicion is the guest IOMMU driver being too old and missing
> lots of fixes, so it could be missing some essential operations that the
> emulation requires to work. e.g. if the guest driver does not comply
> with the spec and fails to issue a DEVTAB_INVALIDATE after changing the
> DTE, the vIOMMU code never gets the chance to enable the IOMMU memory
> region, and it all goes wrong from that point on.
> But I need to reproduce the problem and figure out where/when the >
emulation is failing. I've tested as far back as 5.15 based kernels.
>
> I would argue that while it is something that I am definitely going to
> address if possible, this issue should not be a blocker. I'll update as
> soon as I have more data on the cause.
>
> Thank you,
> Alejandro
>
I also think the same. This may be some old driver issue and we should
not block on it.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Regards
Sairaj Kodilkar
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices
2025-05-02 2:15 [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
` (21 preceding siblings ...)
2025-05-16 8:07 ` Sairaj Kodilkar
@ 2025-05-30 11:41 ` Michael S. Tsirkin
2025-05-30 14:39 ` Alejandro Jimenez
22 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2025-05-30 11:41 UTC (permalink / raw)
To: Alejandro Jimenez
Cc: qemu-devel, pbonzini, richard.henderson, eduardo, peterx, david,
philmd, marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On Fri, May 02, 2025 at 02:15:45AM +0000, Alejandro Jimenez wrote:
> This series adds support for guests using the AMD vIOMMU to enable DMA
> remapping for VFIO devices. In addition to the currently supported
> passthrough (PT) mode, guest kernels are now able to to provide DMA
> address translation and access permission checking to VFs attached to
> paging domains, using the AMD v1 I/O page table format.
>
> Please see v1[0] cover letter for additional details such as example
> QEMU command line parameters used in testing.
are you working on v3? there was a bug you wanted to fix.
> Changes since v1[0]:
> - Added documentation entry for '-device amd-iommu'
> - Code movement with no functional changes to avoid use of forward
> declarations in later patches [Sairaj, mst]
> - Moved addr_translation and dma-remap property to separate commits.
> The dma-remap feature is only available for users to enable after
> all required functionality is implemented [Sairaj]
> - Explicit initialization of significant fields like addr_translation
> and notifier_flags [Sairaj]
> - Fixed bug in decoding of invalidation size [Sairaj]
> - Changed fetch_pte() to use an out parameter for pte, and be able to
> check for error conditions via negative return value [Clement]
> - Removed UNMAP-only notifier optimization, leaving vhost support for
> later series [Sairaj]
> - Fixed ordering between address space unmap and memory region activation
> on devtab invalidation [Sairaj]
> - Fixed commit message with "V=1, TV=0" [Sairaj]
> - Dropped patch removing the page_fault event. That area is better
> addressed in separate series.
> - Independent testing by Sairaj (thank you!)
>
> Thank you,
> Alejandro
>
> [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/
>
> Alejandro Jimenez (20):
> memory: Adjust event ranges to fit within notifier boundaries
> amd_iommu: Document '-device amd-iommu' common options
> amd_iommu: Reorder device and page table helpers
> amd_iommu: Helper to decode size of page invalidation command
> amd_iommu: Add helper function to extract the DTE
> amd_iommu: Return an error when unable to read PTE from guest memory
> amd_iommu: Add helpers to walk AMD v1 Page Table format
> amd_iommu: Add a page walker to sync shadow page tables on
> invalidation
> amd_iommu: Add basic structure to support IOMMU notifier updates
> amd_iommu: Sync shadow page tables on page invalidation
> amd_iommu: Use iova_tree records to determine large page size on UNMAP
> amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
> amd_iommu: Add replay callback
> amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
> amd_iommu: Toggle memory regions based on address translation mode
> amd_iommu: Set all address spaces to default translation mode on reset
> amd_iommu: Add dma-remap property to AMD vIOMMU device
> amd_iommu: Toggle address translation mode on devtab entry
> invalidation
> amd_iommu: Do not assume passthrough translation when DTE[TV]=0
> amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>
> hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++-------
> hw/i386/amd_iommu.h | 52 +++
> qemu-options.hx | 23 +
> system/memory.c | 10 +-
> 4 files changed, 934 insertions(+), 156 deletions(-)
>
>
> base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf
> --
> 2.43.5
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices
2025-05-30 11:41 ` Michael S. Tsirkin
@ 2025-05-30 14:39 ` Alejandro Jimenez
2025-06-02 4:49 ` Sairaj Kodilkar
0 siblings, 1 reply; 54+ messages in thread
From: Alejandro Jimenez @ 2025-05-30 14:39 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, pbonzini, richard.henderson, eduardo, peterx, david,
philmd, marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
On 5/30/25 7:41 AM, Michael S. Tsirkin wrote:
> On Fri, May 02, 2025 at 02:15:45AM +0000, Alejandro Jimenez wrote:
>> This series adds support for guests using the AMD vIOMMU to enable DMA
>> remapping for VFIO devices. In addition to the currently supported
>> passthrough (PT) mode, guest kernels are now able to to provide DMA
>> address translation and access permission checking to VFs attached to
>> paging domains, using the AMD v1 I/O page table format.
>>
>> Please see v1[0] cover letter for additional details such as example
>> QEMU command line parameters used in testing.
>
> are you working on v3?
Yes, there are suggestions from Sairaj that I will address on v3. I am
also planning to include two small patches from Joao Martins that add
support for the HATDis feature (this is something that Sairaj suggested
earlier). The Linux changes are being reviewed here:
https://lore.kernel.org/all/cover.1746613368.git.Ankit.Soni@amd.com/
I will be offline from 6/2 to 6/6, so I didn't want to send a new
revision and disappear. In general, the changes from v2->v3 are minor
and well contained, so any reviews I receive for v2 will be valid.
That being said, I can send v3 today if you'd prefer that. Please let me
know.
> there was a bug you wanted to fix.
>
I assume the bug is Sairaj's report of a dmesg warning with an NVME
passthrough on a 4.15 kernel, but unfortunately I have not been able to
reproduce that problem. We agreed that given the age of the kernel (and
reports of the same warning on NVME devices in unrelated scenarios),
this is likely a guest driver issue, and should not be a blocker.
More details:
I have tested an Ubuntu image with a 4.15 kernel, but I cannot hit any
issues when I passthrough a CX-6 VF (I don't have access to NMVE VF).
The kernel is old enough that I have to force bind the mlx5_core driver
to the VF on the guest, but once I do the VF comes up with no errors and
I can see DMA map/unmap activity in the traces.
Sairaj: Are you passing a full NVME device to the guest (i.e. a PF)? I
ask because the BDF in '-device vfio-pci,host=0000:44:00.0' doesn't look
like a typical VF...
Thank you,
Alejandro
>> Changes since v1[0]:
>> - Added documentation entry for '-device amd-iommu'
>> - Code movement with no functional changes to avoid use of forward
>> declarations in later patches [Sairaj, mst]
>> - Moved addr_translation and dma-remap property to separate commits.
>> The dma-remap feature is only available for users to enable after
>> all required functionality is implemented [Sairaj]
>> - Explicit initialization of significant fields like addr_translation
>> and notifier_flags [Sairaj]
>> - Fixed bug in decoding of invalidation size [Sairaj]
>> - Changed fetch_pte() to use an out parameter for pte, and be able to
>> check for error conditions via negative return value [Clement]
>> - Removed UNMAP-only notifier optimization, leaving vhost support for
>> later series [Sairaj]
>> - Fixed ordering between address space unmap and memory region activation
>> on devtab invalidation [Sairaj]
>> - Fixed commit message with "V=1, TV=0" [Sairaj]
>> - Dropped patch removing the page_fault event. That area is better
>> addressed in separate series.
>> - Independent testing by Sairaj (thank you!)
>>
>> Thank you,
>> Alejandro
>>
>> [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/
>>
>> Alejandro Jimenez (20):
>> memory: Adjust event ranges to fit within notifier boundaries
>> amd_iommu: Document '-device amd-iommu' common options
>> amd_iommu: Reorder device and page table helpers
>> amd_iommu: Helper to decode size of page invalidation command
>> amd_iommu: Add helper function to extract the DTE
>> amd_iommu: Return an error when unable to read PTE from guest memory
>> amd_iommu: Add helpers to walk AMD v1 Page Table format
>> amd_iommu: Add a page walker to sync shadow page tables on
>> invalidation
>> amd_iommu: Add basic structure to support IOMMU notifier updates
>> amd_iommu: Sync shadow page tables on page invalidation
>> amd_iommu: Use iova_tree records to determine large page size on UNMAP
>> amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
>> amd_iommu: Add replay callback
>> amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
>> amd_iommu: Toggle memory regions based on address translation mode
>> amd_iommu: Set all address spaces to default translation mode on reset
>> amd_iommu: Add dma-remap property to AMD vIOMMU device
>> amd_iommu: Toggle address translation mode on devtab entry
>> invalidation
>> amd_iommu: Do not assume passthrough translation when DTE[TV]=0
>> amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>>
>> hw/i386/amd_iommu.c | 1005 ++++++++++++++++++++++++++++++++++++-------
>> hw/i386/amd_iommu.h | 52 +++
>> qemu-options.hx | 23 +
>> system/memory.c | 10 +-
>> 4 files changed, 934 insertions(+), 156 deletions(-)
>>
>>
>> base-commit: 5134cf9b5d3aee4475fe7e1c1c11b093731073cf
>> --
>> 2.43.5
>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 00/20] AMD vIOMMU: DMA remapping support for VFIO devices
2025-05-30 14:39 ` Alejandro Jimenez
@ 2025-06-02 4:49 ` Sairaj Kodilkar
0 siblings, 0 replies; 54+ messages in thread
From: Sairaj Kodilkar @ 2025-06-02 4:49 UTC (permalink / raw)
To: Alejandro Jimenez, Michael S. Tsirkin
Cc: qemu-devel, pbonzini, richard.henderson, eduardo, peterx, david,
philmd, marcel.apfelbaum, alex.williamson, vasant.hegde,
suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
clement.mathieu--drif, ethan.milon, joao.m.martins,
boris.ostrovsky
> Sairaj: Are you passing a full NVME device to the guest (i.e. a PF)? I
> ask because the BDF in '-device vfio-pci,host=0000:44:00.0' doesn't look
> like a typical VF...
>
Hey Alejandro,
I am passing full NVME device to the guest (not just VF).
Thanks
Sairaj
^ permalink raw reply [flat|nested] 54+ messages in thread