Linux kernel -stable discussions
 help / color / mirror / Atom feed
* [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level
@ 2025-09-11 12:14 Vasant Hegde
  2025-09-13  6:10 ` Jörg Rödel
  2025-09-18 14:17 ` Jason Gunthorpe
  0 siblings, 2 replies; 6+ messages in thread
From: Vasant Hegde @ 2025-09-11 12:14 UTC (permalink / raw)
  To: iommu, joro
  Cc: will, robin.murphy, suravee.suthikulpanit, Vasant Hegde,
	Alejandro Jimenez, stable, Joao Martins

The AMD IOMMU host page table implementation supports dynamic page table levels
(up to 6 levels), starting with a 3-level configuration that expands based on
IOVA address. The kernel maintains a root pointer and current page table level
to enable proper page table walks in alloc_pte()/fetch_pte() operations.

The IOMMU IOVA allocator initially starts with 32-bit address and onces its
exhuasted it switches to 64-bit address (max address is determined based
on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU
driver increases page table level.

But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads
pgtable->[root/mode] without lock. So its possible that in exteme corner case,
when increase_address_space() is updating pgtable->[root/mode], fetch_pte()
reads wrong page table level (pgtable->mode). It does compare the value with
level encoded in page table and returns NULL. This will result is
iommu_unmap ops to fail and upper layer may retry/log WARN_ON.

CPU 0                                         CPU 1
------                                       ------
map pages                                    unmap pages
alloc_pte() -> increase_address_space()      iommu_v1_unmap_pages() -> fetch_pte()
  pgtable->root = pte (new root value)
                                             READ pgtable->[mode/root]
					       Reads new root, old mode
  Updates mode (pgtable->mode += 1)

Since Page table level updates are infrequent and already synchronized with a
spinlock, implement seqcount to enable lock-free read operations on the read path.

Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Cc: stable@vger.kernel.org
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
---
 drivers/iommu/amd/amd_iommu_types.h |  1 +
 drivers/iommu/amd/io_pgtable.c      | 25 +++++++++++++++++++++----
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 5219d7ddfdaa..95f63c5f6159 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -555,6 +555,7 @@ struct gcr3_tbl_info {
 };
 
 struct amd_io_pgtable {
+	seqcount_t		seqcount;	/* Protects root/mode update */
 	struct io_pgtable	pgtbl;
 	int			mode;
 	u64			*root;
diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index a91e71f981ef..70c2f5b1631b 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -17,6 +17,7 @@
 #include <linux/slab.h>
 #include <linux/types.h>
 #include <linux/dma-mapping.h>
+#include <linux/seqlock.h>
 
 #include <asm/barrier.h>
 
@@ -130,8 +131,11 @@ static bool increase_address_space(struct amd_io_pgtable *pgtable,
 
 	*pte = PM_LEVEL_PDE(pgtable->mode, iommu_virt_to_phys(pgtable->root));
 
+	write_seqcount_begin(&pgtable->seqcount);
 	pgtable->root  = pte;
 	pgtable->mode += 1;
+	write_seqcount_end(&pgtable->seqcount);
+
 	amd_iommu_update_and_flush_device_table(domain);
 
 	pte = NULL;
@@ -153,6 +157,7 @@ static u64 *alloc_pte(struct amd_io_pgtable *pgtable,
 {
 	unsigned long last_addr = address + (page_size - 1);
 	struct io_pgtable_cfg *cfg = &pgtable->pgtbl.cfg;
+	unsigned int seqcount;
 	int level, end_lvl;
 	u64 *pte, *page;
 
@@ -170,8 +175,14 @@ static u64 *alloc_pte(struct amd_io_pgtable *pgtable,
 	}
 
 
-	level   = pgtable->mode - 1;
-	pte     = &pgtable->root[PM_LEVEL_INDEX(level, address)];
+	do {
+		seqcount = read_seqcount_begin(&pgtable->seqcount);
+
+		level   = pgtable->mode - 1;
+		pte     = &pgtable->root[PM_LEVEL_INDEX(level, address)];
+	} while (read_seqcount_retry(&pgtable->seqcount, seqcount));
+
+
 	address = PAGE_SIZE_ALIGN(address, page_size);
 	end_lvl = PAGE_SIZE_LEVEL(page_size);
 
@@ -249,6 +260,7 @@ static u64 *fetch_pte(struct amd_io_pgtable *pgtable,
 		      unsigned long *page_size)
 {
 	int level;
+	unsigned int seqcount;
 	u64 *pte;
 
 	*page_size = 0;
@@ -256,8 +268,12 @@ static u64 *fetch_pte(struct amd_io_pgtable *pgtable,
 	if (address > PM_LEVEL_SIZE(pgtable->mode))
 		return NULL;
 
-	level	   =  pgtable->mode - 1;
-	pte	   = &pgtable->root[PM_LEVEL_INDEX(level, address)];
+	do {
+		seqcount = read_seqcount_begin(&pgtable->seqcount);
+		level	   =  pgtable->mode - 1;
+		pte	   = &pgtable->root[PM_LEVEL_INDEX(level, address)];
+	} while (read_seqcount_retry(&pgtable->seqcount, seqcount));
+
 	*page_size =  PTE_LEVEL_PAGE_SIZE(level);
 
 	while (level > 0) {
@@ -541,6 +557,7 @@ static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
 	if (!pgtable->root)
 		return NULL;
 	pgtable->mode = PAGE_MODE_3_LEVEL;
+	seqcount_init(&pgtable->seqcount);
 
 	cfg->pgsize_bitmap  = amd_iommu_pgsize_bitmap;
 	cfg->ias            = IOMMU_IN_ADDR_BIT_SIZE;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level
  2025-09-11 12:14 [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level Vasant Hegde
@ 2025-09-13  6:10 ` Jörg Rödel
  2025-09-13  6:18   ` Vasant Hegde
  2025-09-18 14:17 ` Jason Gunthorpe
  1 sibling, 1 reply; 6+ messages in thread
From: Jörg Rödel @ 2025-09-13  6:10 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: iommu, will, robin.murphy, suravee.suthikulpanit,
	Alejandro Jimenez, stable, Joao Martins

Hey Vasant,

On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:
> Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> Cc: stable@vger.kernel.org
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>

Can you please send again with a Fixes tag?


	Joerg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level
  2025-09-13  6:10 ` Jörg Rödel
@ 2025-09-13  6:18   ` Vasant Hegde
  0 siblings, 0 replies; 6+ messages in thread
From: Vasant Hegde @ 2025-09-13  6:18 UTC (permalink / raw)
  To: Jörg Rödel
  Cc: iommu, will, robin.murphy, suravee.suthikulpanit,
	Alejandro Jimenez, stable, Joao Martins

Joerg,

On 9/13/2025 11:40 AM, Jörg Rödel wrote:
> Hey Vasant,
> 
> On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:
>> Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> Cc: stable@vger.kernel.org
>> Cc: Joao Martins <joao.m.martins@oracle.com>
>> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
> 
> Can you please send again with a Fixes tag?

I was not sure which commit I should pick for "Fixes" tag as this issue probably
exists since we introduce increase_address_space() but lot of things changed. So
I just Cced stable.

Looking into the git history, I think below commit is more appropriate. I will
update and resend the patch.

commit 754265bcab78a9014f0f99cd35e0d610fcd7dfa7
Author: Joerg Roedel <jroedel@suse.de>
Date:   Fri Sep 6 10:39:54 2019 +0200

    iommu/amd: Fix race in increase_address_space()

-Vasant


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level
  2025-09-11 12:14 [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level Vasant Hegde
  2025-09-13  6:10 ` Jörg Rödel
@ 2025-09-18 14:17 ` Jason Gunthorpe
  2025-09-22 10:05   ` Vasant Hegde
  1 sibling, 1 reply; 6+ messages in thread
From: Jason Gunthorpe @ 2025-09-18 14:17 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: iommu, joro, will, robin.murphy, suravee.suthikulpanit,
	Alejandro Jimenez, stable, Joao Martins

On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:

> The IOMMU IOVA allocator initially starts with 32-bit address and onces its
> exhuasted it switches to 64-bit address (max address is determined based
> on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU
> driver increases page table level.

Is this the case? I thought I saw something that the allocator is
starting from high addresses?
 
> But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads
> pgtable->[root/mode] without lock. So its possible that in exteme corner case,
> when increase_address_space() is updating pgtable->[root/mode], fetch_pte()
> reads wrong page table level (pgtable->mode). It does compare the value with
> level encoded in page table and returns NULL. This will result is
> iommu_unmap ops to fail and upper layer may retry/log WARN_ON.

Yep, definately a bug, I spotted it already and fixed it in iommupt,
you can read about it here:

https://lore.kernel.org/linux-iommu/13-v5-116c4948af3d+68091-iommu_pt_jgg@nvidia.com/

> CPU 0                                         CPU 1
> ------                                       ------
> map pages                                    unmap pages
> alloc_pte() -> increase_address_space()      iommu_v1_unmap_pages() -> fetch_pte()
>   pgtable->root = pte (new root value)
>                                              READ pgtable->[mode/root]
> 					       Reads new root, old mode
>   Updates mode (pgtable->mode += 1)

This doesn't solve the whole problem, yes reading the two values
coherently is important but we must also serialize parallel map such
that map only returns if the IOMMU is actually programmed with the new
roots.

I don't see that in this fix.

IMHO unless someone is actually hitting this I'd leave it and focus on
merging iomupt which fully fixes this without adding any locks to the
fast path.

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level
  2025-09-18 14:17 ` Jason Gunthorpe
@ 2025-09-22 10:05   ` Vasant Hegde
  2025-09-22 13:02     ` Jason Gunthorpe
  0 siblings, 1 reply; 6+ messages in thread
From: Vasant Hegde @ 2025-09-22 10:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, joro, will, robin.murphy, suravee.suthikulpanit,
	Alejandro Jimenez, stable, Joao Martins

Jason,


On 9/18/2025 7:47 PM, Jason Gunthorpe wrote:
> On Thu, Sep 11, 2025 at 12:14:15PM +0000, Vasant Hegde wrote:
> 
>> The IOMMU IOVA allocator initially starts with 32-bit address and onces its
>> exhuasted it switches to 64-bit address (max address is determined based
>> on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU
>> driver increases page table level.
> 
> Is this the case? I thought I saw something that the allocator is
> starting from high addresses?

Right. By default we start with 32bit address (from 0xffff_ffff to 0x0) and once
its full it goes to 64bit (assuming device supports 64bit). At that point
increase_address_space() gets called.


>  
>> But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads
>> pgtable->[root/mode] without lock. So its possible that in exteme corner case,
>> when increase_address_space() is updating pgtable->[root/mode], fetch_pte()
>> reads wrong page table level (pgtable->mode). It does compare the value with
>> level encoded in page table and returns NULL. This will result is
>> iommu_unmap ops to fail and upper layer may retry/log WARN_ON.
> 
> Yep, definately a bug, I spotted it already and fixed it in iommupt,
> you can read about it here:
> 
> https://lore.kernel.org/linux-iommu/13-v5-116c4948af3d+68091-iommu_pt_jgg@nvidia.com/

Nice. Will take a look this week.

> 
>> CPU 0                                         CPU 1
>> ------                                       ------
>> map pages                                    unmap pages
>> alloc_pte() -> increase_address_space()      iommu_v1_unmap_pages() -> fetch_pte()
>>   pgtable->root = pte (new root value)
>>                                              READ pgtable->[mode/root]
>> 					       Reads new root, old mode
>>   Updates mode (pgtable->mode += 1)
> 
> This doesn't solve the whole problem, yes reading the two values
> coherently is important but we must also serialize parallel map such
> that map only returns if the IOMMU is actually programmed with the new
> roots.
> 
> I don't see that in this fix.
> 
> IMHO unless someone is actually hitting this I'd leave it and focus on
> merging iomupt which fully fixes this without adding any locks to the
> fast path.

Unfortunately yes. We had customer reporting this issue.

-Vasant


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level
  2025-09-22 10:05   ` Vasant Hegde
@ 2025-09-22 13:02     ` Jason Gunthorpe
  0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2025-09-22 13:02 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: iommu, joro, will, robin.murphy, suravee.suthikulpanit,
	Alejandro Jimenez, stable, Joao Martins

On Mon, Sep 22, 2025 at 03:35:07PM +0530, Vasant Hegde wrote:
> > IMHO unless someone is actually hitting this I'd leave it and focus on
> > merging iomupt which fully fixes this without adding any locks to the
> > fast path.
> 
> Unfortunately yes. We had customer reporting this issue.

If a customer is actually hitting this then you definately need to
solve the whole problem including the mmap race.

I guess adding a lock makes it easer to deal with, but this:

+	write_seqcount_begin(&pgtable->seqcount);
 	pgtable->root  = pte;
 	pgtable->mode += 1;
+	write_seqcount_end(&pgtable->seqcount);
+
 	amd_iommu_update_and_flush_device_table(domain);

Is out of order.

The DTE has to be updated and caches flushed *before* writing the new
information to the pgtable for other threads to observe.

You can look at what I did, but broadly you have to feed the new top
as a function argument through to set_dte:
 
 static void set_dte_entry(struct amd_iommu *iommu,
-			  struct iommu_dev_data *dev_data)
+			  struct iommu_dev_data *dev_data,
+			  phys_addr_t top_paddr, unsigned int top_level)
 {
 	u16 domid;

So the above can be ordered correctly.

You may also want to think about just disabling this optimization,
always start with a 6 level table.

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-22 13:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-11 12:14 [PATCH] iommu/amd/pgtbl: Fix possible race while increase page table level Vasant Hegde
2025-09-13  6:10 ` Jörg Rödel
2025-09-13  6:18   ` Vasant Hegde
2025-09-18 14:17 ` Jason Gunthorpe
2025-09-22 10:05   ` Vasant Hegde
2025-09-22 13:02     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox