* [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission
@ 2026-03-10 6:57 CLEMENT MATHIEU--DRIF
2026-03-10 6:57 ` [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte CLEMENT MATHIEU--DRIF
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2026-03-10 6:57 UTC (permalink / raw)
To: qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, yi.l.liu@intel.com,
pbonzini@redhat.com, richard.henderson@linaro.org,
eduardo@habkost.net, marcel.apfelbaum@gmail.com,
zhenzhong.duan@intel.com, CLEMENT MATHIEU--DRIF
Avoid unintentionally flagging the entry as a shadow stack entry.
In the current implementation, the dirty bit is always set in the pte.
Hence, an ATS device requesting an RW translation for a copy-on-write
page is likely to trigger a PRI request for a region that has just been
marked as dirty by the IOMMU. However, CPUs that support shadow stacks
give special meaning to PTEs reporting W=0 and D=1. Setting these values
can cause PRI requests to complete successfully without granting the
expected write permission, which leads the device to enter an infinite
loop of ATS/PRI requests.
Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@bull.com>
---
hw/i386/intel_iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d24ba989bf..56146aafc1 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
return -VTD_FR_FS_PAGING_ENTRY_RSVD;
}
- if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
+ if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
flag_ad |= VTD_FS_D;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte
2026-03-10 6:57 [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission CLEMENT MATHIEU--DRIF
@ 2026-03-10 6:57 ` CLEMENT MATHIEU--DRIF
2026-03-11 11:28 ` Duan, Zhenzhong
2026-03-12 3:31 ` Yi Liu
2026-03-12 2:32 ` [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission Duan, Zhenzhong
2026-03-12 3:30 ` Yi Liu
2 siblings, 2 replies; 10+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2026-03-10 6:57 UTC (permalink / raw)
To: qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, yi.l.liu@intel.com,
pbonzini@redhat.com, richard.henderson@linaro.org,
eduardo@habkost.net, marcel.apfelbaum@gmail.com,
zhenzhong.duan@intel.com, CLEMENT MATHIEU--DRIF
This function was initially designed to write a single flag in a pte.
The current implementation expects it to set several flags.
Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@bull.com>
---
hw/i386/intel_iommu.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 56146aafc1..441c96125b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1991,13 +1991,13 @@ static bool vtd_iova_fs_check_canonical(IntelIOMMUState *s, uint64_t iova,
}
}
-static MemTxResult vtd_set_flag_in_pte(dma_addr_t base_addr, uint32_t index,
- uint64_t pte, uint64_t flag)
+static MemTxResult vtd_set_flags_in_pte(dma_addr_t base_addr, uint32_t index,
+ uint64_t pte, uint64_t flags)
{
- if (pte & flag) {
+ if ((pte & flags) == flags) {
return MEMTX_OK;
}
- pte |= flag;
+ pte |= flags;
pte = cpu_to_le64(pte);
return dma_memory_write(&address_space_memory,
base_addr + index * sizeof(pte),
@@ -2070,7 +2070,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
flag_ad |= VTD_FS_D;
}
- if (vtd_set_flag_in_pte(addr, offset, fspte, flag_ad) != MEMTX_OK) {
+ if (vtd_set_flags_in_pte(addr, offset, fspte, flag_ad) != MEMTX_OK) {
return -VTD_FR_FS_BIT_UPDATE_FAILED;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* RE: [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte
2026-03-10 6:57 ` [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte CLEMENT MATHIEU--DRIF
@ 2026-03-11 11:28 ` Duan, Zhenzhong
2026-03-12 3:31 ` Yi Liu
1 sibling, 0 replies; 10+ messages in thread
From: Duan, Zhenzhong @ 2026-03-11 11:28 UTC (permalink / raw)
To: CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, Liu, Yi L,
pbonzini@redhat.com, richard.henderson@linaro.org,
eduardo@habkost.net, marcel.apfelbaum@gmail.com
>-----Original Message-----
>From: CLEMENT MATHIEU--DRIF <clement.mathieu--drif@bull.com>
>Subject: [PATCH 2/2] intel_iommu: Always write all the flags passed to
>vtd_set_flag_in_pte
>
>This function was initially designed to write a single flag in a pte.
>The current implementation expects it to set several flags.
>
>Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1
>translation")
>Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@bull.com>
Good catch.
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
BRs,
Zhenzhong
>---
> hw/i386/intel_iommu.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>index 56146aafc1..441c96125b 100644
>--- a/hw/i386/intel_iommu.c
>+++ b/hw/i386/intel_iommu.c
>@@ -1991,13 +1991,13 @@ static bool
>vtd_iova_fs_check_canonical(IntelIOMMUState *s, uint64_t iova,
> }
> }
>
>-static MemTxResult vtd_set_flag_in_pte(dma_addr_t base_addr, uint32_t index,
>- uint64_t pte, uint64_t flag)
>+static MemTxResult vtd_set_flags_in_pte(dma_addr_t base_addr, uint32_t index,
>+ uint64_t pte, uint64_t flags)
> {
>- if (pte & flag) {
>+ if ((pte & flags) == flags) {
> return MEMTX_OK;
> }
>- pte |= flag;
>+ pte |= flags;
> pte = cpu_to_le64(pte);
> return dma_memory_write(&address_space_memory,
> base_addr + index * sizeof(pte),
>@@ -2070,7 +2070,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s,
>VTDContextEntry *ce,
> flag_ad |= VTD_FS_D;
> }
>
>- if (vtd_set_flag_in_pte(addr, offset, fspte, flag_ad) != MEMTX_OK) {
>+ if (vtd_set_flags_in_pte(addr, offset, fspte, flag_ad) != MEMTX_OK) {
> return -VTD_FR_FS_BIT_UPDATE_FAILED;
> }
>
>--
>2.53.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission
2026-03-10 6:57 [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission CLEMENT MATHIEU--DRIF
2026-03-10 6:57 ` [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte CLEMENT MATHIEU--DRIF
@ 2026-03-12 2:32 ` Duan, Zhenzhong
2026-03-12 3:30 ` Yi Liu
2 siblings, 0 replies; 10+ messages in thread
From: Duan, Zhenzhong @ 2026-03-12 2:32 UTC (permalink / raw)
To: CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, Liu, Yi L,
pbonzini@redhat.com, richard.henderson@linaro.org,
eduardo@habkost.net, marcel.apfelbaum@gmail.com
Hi Clement,
>-----Original Message-----
>From: CLEMENT MATHIEU--DRIF <clement.mathieu--drif@bull.com>
>Subject: [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write
>permission
>
>Avoid unintentionally flagging the entry as a shadow stack entry.
Not sure if shadow stack here is what I understand about it, which holds return addresses.
What's the relationship between fspte entry and shadow stack entry which holds return address?
>In the current implementation, the dirty bit is always set in the pte.
>Hence, an ATS device requesting an RW translation for a copy-on-write
>page is likely to trigger a PRI request for a region that has just been
>marked as dirty by the IOMMU. However, CPUs that support shadow stacks
>give special meaning to PTEs reporting W=0 and D=1. Setting these values
Could you share a bit more details about the special meaning?
Thanks
Zhenzhong
>can cause PRI requests to complete successfully without granting the
>expected write permission, which leads the device to enter an infinite
>loop of ATS/PRI requests.
>
>Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1
>translation")
>Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@bull.com>
>---
> hw/i386/intel_iommu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>index d24ba989bf..56146aafc1 100644
>--- a/hw/i386/intel_iommu.c
>+++ b/hw/i386/intel_iommu.c
>@@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s,
>VTDContextEntry *ce,
> return -VTD_FR_FS_PAGING_ENTRY_RSVD;
> }
>
>- if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
>+ if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
> flag_ad |= VTD_FS_D;
> }
>
>--
>2.53.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission
2026-03-10 6:57 [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission CLEMENT MATHIEU--DRIF
2026-03-10 6:57 ` [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte CLEMENT MATHIEU--DRIF
2026-03-12 2:32 ` [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission Duan, Zhenzhong
@ 2026-03-12 3:30 ` Yi Liu
2026-03-12 6:20 ` CLEMENT MATHIEU--DRIF
2 siblings, 1 reply; 10+ messages in thread
From: Yi Liu @ 2026-03-12 3:30 UTC (permalink / raw)
To: CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com,
richard.henderson@linaro.org, eduardo@habkost.net,
marcel.apfelbaum@gmail.com, zhenzhong.duan@intel.com
On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
> Avoid unintentionally flagging the entry as a shadow stack entry.
hmmm. not quite get shadow stack...
> In the current implementation, the dirty bit is always set in the pte.
> Hence, an ATS device requesting an RW translation for a copy-on-write
> page is likely to trigger a PRI request for a region that has just been
> marked as dirty by the IOMMU. However, CPUs that support shadow stacks
> give special meaning to PTEs reporting W=0 and D=1. Setting these values
> can cause PRI requests to complete successfully without granting the
> expected write permission, which leads the device to enter an infinite
> loop of ATS/PRI requests.
appreciate a clearer explanation here.
>
> Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
> Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@bull.com>
> ---
> hw/i386/intel_iommu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index d24ba989bf..56146aafc1 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
> return -VTD_FR_FS_PAGING_ENTRY_RSVD;
> }
>
> - if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
> + if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
> flag_ad |= VTD_FS_D;
> }
>
the change looks reasonable.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte
2026-03-10 6:57 ` [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte CLEMENT MATHIEU--DRIF
2026-03-11 11:28 ` Duan, Zhenzhong
@ 2026-03-12 3:31 ` Yi Liu
1 sibling, 0 replies; 10+ messages in thread
From: Yi Liu @ 2026-03-12 3:31 UTC (permalink / raw)
To: CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com,
richard.henderson@linaro.org, eduardo@habkost.net,
marcel.apfelbaum@gmail.com, zhenzhong.duan@intel.com
On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
> This function was initially designed to write a single flag in a pte.
> The current implementation expects it to set several flags.
>
> Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
> Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@bull.com>
> ---
> hw/i386/intel_iommu.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission
2026-03-12 3:30 ` Yi Liu
@ 2026-03-12 6:20 ` CLEMENT MATHIEU--DRIF
2026-03-13 8:24 ` Yi Liu
0 siblings, 1 reply; 10+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2026-03-12 6:20 UTC (permalink / raw)
To: Yi Liu, CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com,
richard.henderson@linaro.org, eduardo@habkost.net,
marcel.apfelbaum@gmail.com, zhenzhong.duan@intel.com
Hi Yi, Zhenzhong,
More details here: https://www.kernel.org/doc/html/next/x86/shstk.html
During my debugging sessions, I got the following flow:
- Device sends an ATS request to the IOMMU without the no-write flag (so it requests RW permissions)
- IOMMU walks the first stage page table, finds the last pte and notices that the W permission is not set (I know this page not really RO, I prepared it to make sure it is COW)
- Because of the condition I update in the patch, the IOMMU sets the dirty bit despite the absence of the R/W bit
- The devices gets the translation completion and decides to issue a PRI request
- The kernel runs the handler, schedules the work, and ends up calling iommu_sva_handle_mm (with flag FAULT_FLAG_WRITE)
- It then calls handle_mm_fault, __handle_mm_fault and handle_pte_fault
- In handle_pte_fault, it runs the following fragment:
```
if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { /* This is true */
if (!pte_write(entry))
return do_wp_page(vmf);
else if (likely(vmf->flags & FAULT_FLAG_WRITE))
entry = pte_mkdirty(entry);
}
```
- pte_write is define as follows
```
int pte_write(pte_t pte)
{
/*
* Shadow stack pages are logically writable, but do not have
* _PAGE_RW. Check for them separately from _PAGE_RW itself.
*/
return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
}
bool pte_shstk(pte_t pte)
{
return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
(pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}
```
- The pte has D=1 W=0 so the first half of the pte_write condition is false
- As the CPU I used for testing has the shstk feature, pte_shstk is true
- Hence, pte_write is true, which means we will run `entry = pte_mkdirty(entry);` in handle_pte_fault
- After the fault handler has completed, the write permission is still not granted
- The PRI request returns a success completion code
- Device requests the translation again to the IOMMU, and goes back to the first step, trapped forever.
Note that the second patch you reviewed was masking this behavior as it was preventing the dirty bit from being added to the pte most of the time.
This is the reason why I fix them in this order o.O
cmd
On Thu, 2026-03-12 at 11:30 +0800, Yi Liu wrote:
> On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
>
> > Avoid unintentionally flagging the entry as a shadow stack entry.
>
>
> hmmm. not quite get shadow stack...
>
>
> > In the current implementation, the dirty bit is always set in the pte.
> > Hence, an ATS device requesting an RW translation for a copy-on-write
> > page is likely to trigger a PRI request for a region that has just been
> > marked as dirty by the IOMMU. However, CPUs that support shadow stacks
> > give special meaning to PTEs reporting W=0 and D=1. Setting these values
> > can cause PRI requests to complete successfully without granting the
> > expected write permission, which leads the device to enter an infinite
> > loop of ATS/PRI requests.
>
>
> appreciate a clearer explanation here.
>
>
>
> > Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
> > Signed-off-by: Clement Mathieu--Drif <[clement.mathieu--drif@bull.com](mailto:clement.mathieu--drif@bull.com)>
> > ---
> > hw/i386/intel_iommu.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index d24ba989bf..56146aafc1 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
> > return -VTD_FR_FS_PAGING_ENTRY_RSVD;
> > }
> >
> > - if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
> > + if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
> > flag_ad |= VTD_FS_D;
> > }
> >
>
>
> the change looks reasonable.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission
2026-03-12 6:20 ` CLEMENT MATHIEU--DRIF
@ 2026-03-13 8:24 ` Yi Liu
2026-03-13 14:46 ` CLEMENT MATHIEU--DRIF
0 siblings, 1 reply; 10+ messages in thread
From: Yi Liu @ 2026-03-13 8:24 UTC (permalink / raw)
To: CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com,
richard.henderson@linaro.org, eduardo@habkost.net,
marcel.apfelbaum@gmail.com, zhenzhong.duan@intel.com
On 3/12/26 14:20, CLEMENT MATHIEU--DRIF wrote:
> Hi Yi, Zhenzhong,
>
> More details here: https://www.kernel.org/doc/html/next/x86/shstk.html
>
> During my debugging sessions, I got the following flow:
>
> - Device sends an ATS request to the IOMMU without the no-write flag (so it requests RW permissions)
> - IOMMU walks the first stage page table, finds the last pte and notices that the W permission is not set (I know this page not really RO, I prepared it to make sure it is COW)
> - Because of the condition I update in the patch, the IOMMU sets the dirty bit despite the absence of the R/W bit
> - The devices gets the translation completion and decides to issue a PRI request
> - The kernel runs the handler, schedules the work, and ends up calling iommu_sva_handle_mm (with flag FAULT_FLAG_WRITE)
> - It then calls handle_mm_fault, __handle_mm_fault and handle_pte_fault
> - In handle_pte_fault, it runs the following fragment:
>
> ```
> if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { /* This is true */
> if (!pte_write(entry))
> return do_wp_page(vmf);
> else if (likely(vmf->flags & FAULT_FLAG_WRITE))
> entry = pte_mkdirty(entry);
> }
> ```
>
> - pte_write is define as follows
>
> ```
> int pte_write(pte_t pte)
> {
> /*
> * Shadow stack pages are logically writable, but do not have
> * _PAGE_RW. Check for them separately from _PAGE_RW itself.
> */
> return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
> }
>
> bool pte_shstk(pte_t pte)
> {
> return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
> }
> ```
>
> - The pte has D=1 W=0 so the first half of the pte_write condition is false
> - As the CPU I used for testing has the shstk feature, pte_shstk is true
> - Hence, pte_write is true, which means we will run `entry = pte_mkdirty(entry);` in handle_pte_fault
> - After the fault handler has completed, the write permission is still not granted
> - The PRI request returns a success completion code
> - Device requests the translation again to the IOMMU, and goes back to the first step, trapped forever.
>
> Note that the second patch you reviewed was masking this behavior as it was preventing the dirty bit from being added to the pte most of the time.
> This is the reason why I fix them in this order o.O
Understood. You may want to refine the commit message slightly. In my
opinion, regardless of any bugs, it's essential to set the dirty bit
only when write permission is granted AND the device requests write access.
If you wish to highlight the specific issue: setting the dirty bit
without the write bit in vIOMMU creates a spoofed shadow stack PTE that
misleads the guest OS. This results in infinite PRI requests from the
device, as no actual write permission is ever granted - the guest OS
treats the PTE as a legitimate shadow stack entry and considers the
permissions already correct.
> On Thu, 2026-03-12 at 11:30 +0800, Yi Liu wrote:
>> On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
>>
>>> Avoid unintentionally flagging the entry as a shadow stack entry.
>>
>>
>> hmmm. not quite get shadow stack...
>>
>>
>>> In the current implementation, the dirty bit is always set in the pte.
>>> Hence, an ATS device requesting an RW translation for a copy-on-write
>>> page is likely to trigger a PRI request for a region that has just been
>>> marked as dirty by the IOMMU. However, CPUs that support shadow stacks
>>> give special meaning to PTEs reporting W=0 and D=1. Setting these values
>>> can cause PRI requests to complete successfully without granting the
>>> expected write permission, which leads the device to enter an infinite
>>> loop of ATS/PRI requests.
>>
>>
>> appreciate a clearer explanation here.
>>
>>
>>
>>> Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
>>> Signed-off-by: Clement Mathieu--Drif <[clement.mathieu--drif@bull.com](mailto:clement.mathieu--drif@bull.com)>
>>> ---
>>> hw/i386/intel_iommu.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index d24ba989bf..56146aafc1 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
>>> return -VTD_FR_FS_PAGING_ENTRY_RSVD;
>>> }
>>>
>>> - if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
>>> + if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
>>> flag_ad |= VTD_FS_D;
>>> }
>>>
>>
>>
>> the change looks reasonable.
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission
2026-03-13 8:24 ` Yi Liu
@ 2026-03-13 14:46 ` CLEMENT MATHIEU--DRIF
0 siblings, 0 replies; 10+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2026-03-13 14:46 UTC (permalink / raw)
To: Yi Liu, CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com,
richard.henderson@linaro.org, eduardo@habkost.net,
marcel.apfelbaum@gmail.com, zhenzhong.duan@intel.com
On Fri, 2026-03-13 at 16:24 +0800, Yi Liu wrote:
> On 3/12/26 14:20, CLEMENT MATHIEU--DRIF wrote:
>
> > Hi Yi, Zhenzhong,
> >
> > More details here: [https://www.kernel.org/doc/html/next/x86/shstk.html](https://www.kernel.org/doc/html/next/x86/shstk.html)
> >
> > During my debugging sessions, I got the following flow:
> >
> > - Device sends an ATS request to the IOMMU without the no-write flag (so it requests RW permissions)
> > - IOMMU walks the first stage page table, finds the last pte and notices that the W permission is not set (I know this page not really RO, I prepared it to make sure it is COW)
> > - Because of the condition I update in the patch, the IOMMU sets the dirty bit despite the absence of the R/W bit
> > - The devices gets the translation completion and decides to issue a PRI request
> > - The kernel runs the handler, schedules the work, and ends up calling iommu_sva_handle_mm (with flag FAULT_FLAG_WRITE)
> > - It then calls handle_mm_fault, __handle_mm_fault and handle_pte_fault
> > - In handle_pte_fault, it runs the following fragment:
> >
> > ```
> > if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { /* This is true */
> > if (!pte_write(entry))
> > return do_wp_page(vmf);
> > else if (likely(vmf->flags & FAULT_FLAG_WRITE))
> > entry = pte_mkdirty(entry);
> > }
> > ```
> >
> > - pte_write is define as follows
> >
> > ```
> > int pte_write(pte_t pte)
> > {
> > /*
> > * Shadow stack pages are logically writable, but do not have
> > * _PAGE_RW. Check for them separately from _PAGE_RW itself.
> > */
> > return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
> > }
> >
> > bool pte_shstk(pte_t pte)
> > {
> > return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> > (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
> > }
> > ```
> >
> > - The pte has D=1 W=0 so the first half of the pte_write condition is false
> > - As the CPU I used for testing has the shstk feature, pte_shstk is true
> > - Hence, pte_write is true, which means we will run `entry = pte_mkdirty(entry);` in handle_pte_fault
> > - After the fault handler has completed, the write permission is still not granted
> > - The PRI request returns a success completion code
> > - Device requests the translation again to the IOMMU, and goes back to the first step, trapped forever.
> >
> > Note that the second patch you reviewed was masking this behavior as it was preventing the dirty bit from being added to the pte most of the time.
> > This is the reason why I fix them in this order o.O
>
>
> Understood. You may want to refine the commit message slightly. In my
> opinion, regardless of any bugs, it's essential to set the dirty bit
> only when write permission is granted AND the device requests write access.
Yes, that's my understanding as well
>
> If you wish to highlight the specific issue: setting the dirty bit
> without the write bit in vIOMMU creates a spoofed shadow stack PTE that
> misleads the guest OS. This results in infinite PRI requests from the
> device, as no actual write permission is ever granted - the guest OS
> treats the PTE as a legitimate shadow stack entry and considers the
> permissions already correct.
Crystal clear, will repost with the updated message
>
>
> > On Thu, 2026-03-12 at 11:30 +0800, Yi Liu wrote:
> >
> > > On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
> > >
> > >
> > > > Avoid unintentionally flagging the entry as a shadow stack entry.
> > >
> > >
> > >
> > > hmmm. not quite get shadow stack...
> > >
> > >
> > >
> > > > In the current implementation, the dirty bit is always set in the pte.
> > > > Hence, an ATS device requesting an RW translation for a copy-on-write
> > > > page is likely to trigger a PRI request for a region that has just been
> > > > marked as dirty by the IOMMU. However, CPUs that support shadow stacks
> > > > give special meaning to PTEs reporting W=0 and D=1. Setting these values
> > > > can cause PRI requests to complete successfully without granting the
> > > > expected write permission, which leads the device to enter an infinite
> > > > loop of ATS/PRI requests.
> > >
> > >
> > >
> > > appreciate a clearer explanation here.
> > >
> > >
> > >
> > >
> > > > Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
> > > > Signed-off-by: Clement Mathieu--Drif <[[clement.mathieu--drif@bull.com](mailto:clement.mathieu--drif@bull.com)](mailto:[clement.mathieu--drif@bull.com](mailto:clement.mathieu--drif@bull.com))>
> > > > ---
> > > > hw/i386/intel_iommu.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index d24ba989bf..56146aafc1 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
> > > > return -VTD_FR_FS_PAGING_ENTRY_RSVD;
> > > > }
> > > >
> > > > - if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
> > > > + if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
> > > > flag_ad |= VTD_FS_D;
> > > > }
> > > >
> > >
> > >
> > >
> > > the change looks reasonable.
> >
>
>
> Reviewed-by: Yi Liu <[yi.l.liu@intel.com](mailto:yi.l.liu@intel.com)>
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission
@ 2026-03-13 14:53 CLEMENT MATHIEU--DRIF
0 siblings, 0 replies; 10+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2026-03-13 14:53 UTC (permalink / raw)
To: qemu-devel@nongnu.org
Cc: philmd@linaro.org, thuth@redhat.com, pierrick.bouvier@linaro.org,
richard.henderson@linaro.org, clg@redhat.com,
CLEMENT MATHIEU--DRIF, Yi Liu
Setting the dirty bit without the write bit in vIOMMU creates a spoofed
shadow stack PTE that misleads the guest OS. This results in infinite
PRI requests from the device, as no actual write permission is ever
granted - the guest OS treats the PTE as a legitimate shadow stack entry
and considers the permissions already correct.
Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1 translation")
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@bull.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
hw/i386/intel_iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d24ba989bf..56146aafc1 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
return -VTD_FR_FS_PAGING_ENTRY_RSVD;
}
- if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
+ if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
flag_ad |= VTD_FS_D;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-03-13 14:54 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 6:57 [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission CLEMENT MATHIEU--DRIF
2026-03-10 6:57 ` [PATCH 2/2] intel_iommu: Always write all the flags passed to vtd_set_flag_in_pte CLEMENT MATHIEU--DRIF
2026-03-11 11:28 ` Duan, Zhenzhong
2026-03-12 3:31 ` Yi Liu
2026-03-12 2:32 ` [PATCH 1/2] intel_iommu: Only set dirty bit when PTE exposes write permission Duan, Zhenzhong
2026-03-12 3:30 ` Yi Liu
2026-03-12 6:20 ` CLEMENT MATHIEU--DRIF
2026-03-13 8:24 ` Yi Liu
2026-03-13 14:46 ` CLEMENT MATHIEU--DRIF
-- strict thread matches above, loose matches on Subject: below --
2026-03-13 14:53 CLEMENT MATHIEU--DRIF
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox