public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
@ 2026-04-09 12:54 Brian Ruley
  2026-04-09 13:56 ` Will Deacon
  2026-04-09 14:15 ` [PATCH] " Russell King (Oracle)
  0 siblings, 2 replies; 10+ messages in thread
From: Brian Ruley @ 2026-04-09 12:54 UTC (permalink / raw)
  To: Russell King, Steve Capper, Will Deacon
  Cc: Brian Ruley, Russell King, linux-arm-kernel, linux-kernel

Fixes cache desync, which can cause undefined instruction,
translation and permission faults under heavy memory use.

This is an old bug introduced in commit 1971188aa196 ("ARM: 7985/1: mm:
implement pte_accessible for faulting mappings"), which included a check
for the young bit of a PTE. The underlying assumption was that old pages
are not cached, therefore, `__sync_icache_dcache' could be skipped
entirely.

However, under extreme memory pressure, page migrations happen
frequently and the assumption of uncached "old" pages does not hold.
Especially for systems that do not have swap, the migrated pages are
unequivocally marked old. This presents a problem, as it is possible
for the original page to be immediately mapped to another VA that
happens to share the same cache index in VIPT I-cache (we found this
bug on Cortex-A9). Without cache invalidation, the CPU will see the
old mapping whose physical page can now be used for a different
purpose, as illustrated below:

                Core                      Physical Memory
  +-------------------------------+     +------------------+
  | TLB                           |     |                  |
  |  VA_A 0xb6e6f -> pfn_q        |     | pfn_q: code      |
  +-------------------------------+     +------------------+
  | I-cache                       |
  |  set[VA_A bits] | tag=pfn_q   |
  +-------------------------------+

migrate (kcompactd):
  1. copy pfn_q --> pfn_r
  2. free pfn_q
  3. pte: VA_a -> pfn_r
  4. pte_mkold(pte) --> !young
  5. ICIALLUIS skipped (because !young)

pfn_src reused (OOM pressure):
  pte: VA_B -> pfn_q (different code)

bug:
                Core                      Physical Memory
  +-------------------------------+     +------------------+
  | TLB (empty)                   |     | pfn_r: old code  |
  +-------------------------------+     | pfn_q: new code  |
  | I-cache                       |     +------------------+
  |  set[VA_A bits] | tag=pfn_q   |<--- wrong instructions
  +-------------------------------+

This was verified on ba16-based board (i.MX6Quad/Dual, Cortex-A9) by
instrumenting the migration code to track recently migrated pages in a
ring buffer and then dumping them in the undefined instruction fault
handler. The bug can be triggered with `stress-ng':

  stress-ng --vm 4 --vm-bytes 2G --vm-method zero-one --verify

Note that the system we tested on has only 2G of memory, so the test
triggered the OOM-killer in our case.

Fixes: 1971188aa196 ("ARM: 7985/1: mm: implement pte_accessible for faulting mappings")
Signed-off-by: Brian Ruley <brian.ruley@gehealthcare.com>
---
 arch/arm/include/asm/pgtable.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 6fa9acd6a7f5..e3a5b4a9a65f 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -185,7 +185,7 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
 #define pte_exec(pte)		(pte_isclear((pte), L_PTE_XN))
 
 #define pte_valid_user(pte)	\
-	(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
+	(pte_valid(pte) && pte_isset((pte), L_PTE_USER))
 
 static inline bool pte_access_permitted(pte_t pte, bool write)
 {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-09 12:54 [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user Brian Ruley
@ 2026-04-09 13:56 ` Will Deacon
  2026-04-09 14:21   ` Russell King (Oracle)
                     ` (2 more replies)
  2026-04-09 14:15 ` [PATCH] " Russell King (Oracle)
  1 sibling, 3 replies; 10+ messages in thread
From: Will Deacon @ 2026-04-09 13:56 UTC (permalink / raw)
  To: Brian Ruley
  Cc: Russell King, Steve Capper, Russell King, linux-arm-kernel,
	linux-kernel

On Thu, Apr 09, 2026 at 03:54:45PM +0300, Brian Ruley wrote:
> Fixes cache desync, which can cause undefined instruction,
> translation and permission faults under heavy memory use.
> 
> This is an old bug introduced in commit 1971188aa196 ("ARM: 7985/1: mm:
> implement pte_accessible for faulting mappings"), which included a check
> for the young bit of a PTE. The underlying assumption was that old pages
> are not cached, therefore, `__sync_icache_dcache' could be skipped
> entirely.
> 
> However, under extreme memory pressure, page migrations happen
> frequently and the assumption of uncached "old" pages does not hold.
> Especially for systems that do not have swap, the migrated pages are
> unequivocally marked old. This presents a problem, as it is possible
> for the original page to be immediately mapped to another VA that
> happens to share the same cache index in VIPT I-cache (we found this
> bug on Cortex-A9). Without cache invalidation, the CPU will see the
> old mapping whose physical page can now be used for a different
> purpose, as illustrated below:
> 
>                 Core                      Physical Memory
>   +-------------------------------+     +------------------+
>   | TLB                           |     |                  |
>   |  VA_A 0xb6e6f -> pfn_q        |     | pfn_q: code      |
>   +-------------------------------+     +------------------+
>   | I-cache                       |
>   |  set[VA_A bits] | tag=pfn_q   |
>   +-------------------------------+
> 
> migrate (kcompactd):
>   1. copy pfn_q --> pfn_r
>   2. free pfn_q
>   3. pte: VA_a -> pfn_r
>   4. pte_mkold(pte) --> !young
>   5. ICIALLUIS skipped (because !young)
> 
> pfn_src reused (OOM pressure):
>   pte: VA_B -> pfn_q (different code)
> 
> bug:
>                 Core                      Physical Memory
>   +-------------------------------+     +------------------+
>   | TLB (empty)                   |     | pfn_r: old code  |
>   +-------------------------------+     | pfn_q: new code  |
>   | I-cache                       |     +------------------+
>   |  set[VA_A bits] | tag=pfn_q   |<--- wrong instructions
>   +-------------------------------+

(nit: Do you have pfn_r and pfn_q mixed up in the "Physical Memory" box?)

> This was verified on ba16-based board (i.MX6Quad/Dual, Cortex-A9) by
> instrumenting the migration code to track recently migrated pages in a
> ring buffer and then dumping them in the undefined instruction fault
> handler. The bug can be triggered with `stress-ng':
> 
>   stress-ng --vm 4 --vm-bytes 2G --vm-method zero-one --verify
> 
> Note that the system we tested on has only 2G of memory, so the test
> triggered the OOM-killer in our case.
> 
> Fixes: 1971188aa196 ("ARM: 7985/1: mm: implement pte_accessible for faulting mappings")
> Signed-off-by: Brian Ruley <brian.ruley@gehealthcare.com>
> ---
>  arch/arm/include/asm/pgtable.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index 6fa9acd6a7f5..e3a5b4a9a65f 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -185,7 +185,7 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
>  #define pte_exec(pte)		(pte_isclear((pte), L_PTE_XN))
>  
>  #define pte_valid_user(pte)	\
> -	(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
> +	(pte_valid(pte) && pte_isset((pte), L_PTE_USER))

This patch is from twelve years ago, so please forgive me for having
forgotten all of the details. However, my recollection is that when using
the classic/!lpae format (as you will be on Cortex-A9), page aging is
implemented by using invalid (translation faulting) ptes for 'old'
mappings.

So in the case you describe, we may well elide the I-cache maintenance,
but won't we also put down an invalid pte? If we later take a fault
on that, we should then perform the cache maintenance when installing
the young entry (via ptep_set_access_flags()). The more interesting part
is probably when the mapping for 'VA_B' is installed to map 'pfn_q' but,
again, I would've expected the cache maintenance to happen just prior to
installing the valid (young) mapping.

Please can you help me to understand the problem better?

Will


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-09 12:54 [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user Brian Ruley
  2026-04-09 13:56 ` Will Deacon
@ 2026-04-09 14:15 ` Russell King (Oracle)
  1 sibling, 0 replies; 10+ messages in thread
From: Russell King (Oracle) @ 2026-04-09 14:15 UTC (permalink / raw)
  To: Brian Ruley; +Cc: Steve Capper, Will Deacon, linux-arm-kernel, linux-kernel

On Thu, Apr 09, 2026 at 03:54:45PM +0300, Brian Ruley wrote:
> Fixes cache desync, which can cause undefined instruction,
> translation and permission faults under heavy memory use.
> 
> This is an old bug introduced in commit 1971188aa196 ("ARM: 7985/1: mm:
> implement pte_accessible for faulting mappings"), which included a check
> for the young bit of a PTE. The underlying assumption was that old pages
> are not cached, therefore, `__sync_icache_dcache' could be skipped
> entirely.
> 
> However, under extreme memory pressure, page migrations happen
> frequently and the assumption of uncached "old" pages does not hold.

The first thing to point out is that PTEs that are marked as "old" are
not mapped into userspace. They need to take a fault to be marked
young, which will involve another call to set_pte(), at which point
pte_valid_user() should return true. Your assumption that this is
about "old" pages being uncached is totally incorrect - there has
never been such an assumption.

> Especially for systems that do not have swap, the migrated pages are
> unequivocally marked old. This presents a problem, as it is possible
> for the original page to be immediately mapped to another VA that
> happens to share the same cache index in VIPT I-cache (we found this
> bug on Cortex-A9). Without cache invalidation, the CPU will see the
> old mapping whose physical page can now be used for a different
> purpose, as illustrated below:



> 
>                 Core                      Physical Memory
>   +-------------------------------+     +------------------+
>   | TLB                           |     |                  |
>   |  VA_A 0xb6e6f -> pfn_q        |     | pfn_q: code      |
>   +-------------------------------+     +------------------+
>   | I-cache                       |
>   |  set[VA_A bits] | tag=pfn_q   |
>   +-------------------------------+
> 
> migrate (kcompactd):
>   1. copy pfn_q --> pfn_r
>   2. free pfn_q
>   3. pte: VA_a -> pfn_r
>   4. pte_mkold(pte) --> !young
>   5. ICIALLUIS skipped (because !young)

At this point, the hardware PTE will be set to zero and the TLB
invalidated. This _should_ mean that any future access should result
in a page permission fault being raised. That will then provoke the
MM to mark the PTE young, which will then result in set_ptes()
being called, and thus __sync_icache_dcache() will be called for
the _neew_ pte (which will be for pfn_r.)

> 
> pfn_src reused (OOM pressure):
>   pte: VA_B -> pfn_q (different code)
> 
> bug:
>                 Core                      Physical Memory
>   +-------------------------------+     +------------------+
>   | TLB (empty)                   |     | pfn_r: old code  |
>   +-------------------------------+     | pfn_q: new code  |
>   | I-cache                       |     +------------------+
>   |  set[VA_A bits] | tag=pfn_q   |<--- wrong instructions
>   +-------------------------------+
> 
> This was verified on ba16-based board (i.MX6Quad/Dual, Cortex-A9) by
> instrumenting the migration code to track recently migrated pages in a
> ring buffer and then dumping them in the undefined instruction fault
> handler. The bug can be triggered with `stress-ng':
> 
>   stress-ng --vm 4 --vm-bytes 2G --vm-method zero-one --verify
> 
> Note that the system we tested on has only 2G of memory, so the test
> triggered the OOM-killer in our case.

So you're saying that stress-ng doesn't reproduce this bug but triggers
the OOM-killer... confused.

Cortex-A9 has been around for a long time - I have systems that still
use Cortex-A9 every day without swap, and they have been rock solid.

If there was a bug like this, I would've expected to see problems, but
I'm not... so, I'm not convinced there's a problem here.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-09 13:56 ` Will Deacon
@ 2026-04-09 14:21   ` Russell King (Oracle)
  2026-04-09 14:43   ` Russell King (Oracle)
  2026-04-09 15:17   ` Brian Ruley
  2 siblings, 0 replies; 10+ messages in thread
From: Russell King (Oracle) @ 2026-04-09 14:21 UTC (permalink / raw)
  To: Will Deacon; +Cc: Brian Ruley, Steve Capper, linux-arm-kernel, linux-kernel

On Thu, Apr 09, 2026 at 02:56:53PM +0100, Will Deacon wrote:
> On Thu, Apr 09, 2026 at 03:54:45PM +0300, Brian Ruley wrote:
> > Fixes cache desync, which can cause undefined instruction,
> > translation and permission faults under heavy memory use.
> > 
> > This is an old bug introduced in commit 1971188aa196 ("ARM: 7985/1: mm:
> > implement pte_accessible for faulting mappings"), which included a check
> > for the young bit of a PTE. The underlying assumption was that old pages
> > are not cached, therefore, `__sync_icache_dcache' could be skipped
> > entirely.
> > 
> > However, under extreme memory pressure, page migrations happen
> > frequently and the assumption of uncached "old" pages does not hold.
> > Especially for systems that do not have swap, the migrated pages are
> > unequivocally marked old. This presents a problem, as it is possible
> > for the original page to be immediately mapped to another VA that
> > happens to share the same cache index in VIPT I-cache (we found this
> > bug on Cortex-A9). Without cache invalidation, the CPU will see the
> > old mapping whose physical page can now be used for a different
> > purpose, as illustrated below:
> > 
> >                 Core                      Physical Memory
> >   +-------------------------------+     +------------------+
> >   | TLB                           |     |                  |
> >   |  VA_A 0xb6e6f -> pfn_q        |     | pfn_q: code      |
> >   +-------------------------------+     +------------------+
> >   | I-cache                       |
> >   |  set[VA_A bits] | tag=pfn_q   |
> >   +-------------------------------+
> > 
> > migrate (kcompactd):
> >   1. copy pfn_q --> pfn_r
> >   2. free pfn_q
> >   3. pte: VA_a -> pfn_r
> >   4. pte_mkold(pte) --> !young
> >   5. ICIALLUIS skipped (because !young)
> > 
> > pfn_src reused (OOM pressure):
> >   pte: VA_B -> pfn_q (different code)
> > 
> > bug:
> >                 Core                      Physical Memory
> >   +-------------------------------+     +------------------+
> >   | TLB (empty)                   |     | pfn_r: old code  |
> >   +-------------------------------+     | pfn_q: new code  |
> >   | I-cache                       |     +------------------+
> >   |  set[VA_A bits] | tag=pfn_q   |<--- wrong instructions
> >   +-------------------------------+
> 
> (nit: Do you have pfn_r and pfn_q mixed up in the "Physical Memory" box?)
> 
> > This was verified on ba16-based board (i.MX6Quad/Dual, Cortex-A9) by
> > instrumenting the migration code to track recently migrated pages in a
> > ring buffer and then dumping them in the undefined instruction fault
> > handler. The bug can be triggered with `stress-ng':
> > 
> >   stress-ng --vm 4 --vm-bytes 2G --vm-method zero-one --verify
> > 
> > Note that the system we tested on has only 2G of memory, so the test
> > triggered the OOM-killer in our case.
> > 
> > Fixes: 1971188aa196 ("ARM: 7985/1: mm: implement pte_accessible for faulting mappings")
> > Signed-off-by: Brian Ruley <brian.ruley@gehealthcare.com>
> > ---
> >  arch/arm/include/asm/pgtable.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> > index 6fa9acd6a7f5..e3a5b4a9a65f 100644
> > --- a/arch/arm/include/asm/pgtable.h
> > +++ b/arch/arm/include/asm/pgtable.h
> > @@ -185,7 +185,7 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
> >  #define pte_exec(pte)		(pte_isclear((pte), L_PTE_XN))
> >  
> >  #define pte_valid_user(pte)	\
> > -	(pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
> > +	(pte_valid(pte) && pte_isset((pte), L_PTE_USER))
> 
> This patch is from twelve years ago, so please forgive me for having
> forgotten all of the details. However, my recollection is that when using
> the classic/!lpae format (as you will be on Cortex-A9), page aging is
> implemented by using invalid (translation faulting) ptes for 'old'
> mappings.

It is.

> So in the case you describe, we may well elide the I-cache maintenance,
> but won't we also put down an invalid pte?

Correct.

> If we later take a fault
> on that, we should then perform the cache maintenance when installing
> the young entry (via ptep_set_access_flags()).

Correct again.

> The more interesting part
> is probably when the mapping for 'VA_B' is installed to map 'pfn_q' but,
> again, I would've expected the cache maintenance to happen just prior to
> installing the valid (young) mapping.

Also correct - for the new PTE to become accessible in userspace, we
would need to establish a young PTE, which will result in set_ptes()
being called, and that should trigger __flush_icache_all() which will
flush the _entire_ instruction cache, which will remove any stale
entries for the old mapping that is no longer accessible.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-09 13:56 ` Will Deacon
  2026-04-09 14:21   ` Russell King (Oracle)
@ 2026-04-09 14:43   ` Russell King (Oracle)
  2026-04-09 15:17   ` Brian Ruley
  2 siblings, 0 replies; 10+ messages in thread
From: Russell King (Oracle) @ 2026-04-09 14:43 UTC (permalink / raw)
  To: Will Deacon; +Cc: Brian Ruley, Steve Capper, linux-arm-kernel, linux-kernel

On Thu, Apr 09, 2026 at 02:56:53PM +0100, Will Deacon wrote:
> On Thu, Apr 09, 2026 at 03:54:45PM +0300, Brian Ruley wrote:
> > Fixes cache desync, which can cause undefined instruction,
> > translation and permission faults under heavy memory use.
> > 
> > This is an old bug introduced in commit 1971188aa196 ("ARM: 7985/1: mm:
> > implement pte_accessible for faulting mappings"), which included a check
> > for the young bit of a PTE. The underlying assumption was that old pages
> > are not cached, therefore, `__sync_icache_dcache' could be skipped
> > entirely.
> > 
> > However, under extreme memory pressure, page migrations happen
> > frequently and the assumption of uncached "old" pages does not hold.
> > Especially for systems that do not have swap, the migrated pages are
> > unequivocally marked old. This presents a problem, as it is possible
> > for the original page to be immediately mapped to another VA that
> > happens to share the same cache index in VIPT I-cache (we found this
> > bug on Cortex-A9). Without cache invalidation, the CPU will see the
> > old mapping whose physical page can now be used for a different
> > purpose, as illustrated below:
> > 
> >                 Core                      Physical Memory
> >   +-------------------------------+     +------------------+
> >   | TLB                           |     |                  |
> >   |  VA_A 0xb6e6f -> pfn_q        |     | pfn_q: code      |
> >   +-------------------------------+     +------------------+
> >   | I-cache                       |
> >   |  set[VA_A bits] | tag=pfn_q   |
> >   +-------------------------------+
> > 
> > migrate (kcompactd):
> >   1. copy pfn_q --> pfn_r
> >   2. free pfn_q
> >   3. pte: VA_a -> pfn_r
> >   4. pte_mkold(pte) --> !young
> >   5. ICIALLUIS skipped (because !young)
> > 
> > pfn_src reused (OOM pressure):
> >   pte: VA_B -> pfn_q (different code)
> > 
> > bug:
> >                 Core                      Physical Memory
> >   +-------------------------------+     +------------------+
> >   | TLB (empty)                   |     | pfn_r: old code  |
> >   +-------------------------------+     | pfn_q: new code  |
> >   | I-cache                       |     +------------------+
> >   |  set[VA_A bits] | tag=pfn_q   |<--- wrong instructions
> >   +-------------------------------+
> 
> (nit: Do you have pfn_r and pfn_q mixed up in the "Physical Memory" box?)

I don't think so. pfn_r contains the code that _was_ in pfn_q before
the migration happened (the migration copied pfn_q to pfn_r).

Then, a short time later, the page for pfn_r is reallocated, with new
code placed into it, and then a new PTE is established for pfn_q -
however, this should be a young PTE (which will be necessary for the
mapping to be visible to userspace), and thus should cause
__sync_icache_dcache() to be called, resulting in __flush_icache_all()
if it is an executable mapping.

If it isn't an executable mapping (because pfn_q isn't code) then we
will skip the __flush_icache_all() for the new mapping.

However, the I-cache for the old PTE that was pfn_r and now is pfn_q
will not be present in the physical page tables. An attempt to execute
code from that mapping should fault, causing the MM to mark that PTE
young, and, as it's executable, it should result in
__flush_icache_all() being called.

Like you, I can't see any issue here.

I think we need to know exactly what happened to the old PTE entry
and the newer PTE entry that was subsequently established on the
lead-up to the undefined instruction exception.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-09 13:56 ` Will Deacon
  2026-04-09 14:21   ` Russell King (Oracle)
  2026-04-09 14:43   ` Russell King (Oracle)
@ 2026-04-09 15:17   ` Brian Ruley
  2026-04-09 16:00     ` Russell King (Oracle)
  2 siblings, 1 reply; 10+ messages in thread
From: Brian Ruley @ 2026-04-09 15:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Russell King, Steve Capper, Russell King, linux-arm-kernel,
	linux-kernel

On Apr 09, Will Deacon wrote:
> 
> On Thu, Apr 09, 2026 at 03:54:45PM +0300, Brian Ruley wrote:
> > Fixes cache desync, which can cause undefined instruction,
> > translation and permission faults under heavy memory use.
> >
> > This is an old bug introduced in commit 1971188aa196 ("ARM: 7985/1: mm:
> > implement pte_accessible for faulting mappings"), which included a check
> > for the young bit of a PTE. The underlying assumption was that old pages
> > are not cached, therefore, `__sync_icache_dcache' could be skipped
> > entirely.
> >
> > However, under extreme memory pressure, page migrations happen
> > frequently and the assumption of uncached "old" pages does not hold.
> > Especially for systems that do not have swap, the migrated pages are
> > unequivocally marked old. This presents a problem, as it is possible
> > for the original page to be immediately mapped to another VA that
> > happens to share the same cache index in VIPT I-cache (we found this
> > bug on Cortex-A9). Without cache invalidation, the CPU will see the
> > old mapping whose physical page can now be used for a different
> > purpose, as illustrated below:
> >
> >                 Core                      Physical Memory
> >   +-------------------------------+     +------------------+
> >   | TLB                           |     |                  |
> >   |  VA_A 0xb6e6f -> pfn_q        |     | pfn_q: code      |
> >   +-------------------------------+     +------------------+
> >   | I-cache                       |
> >   |  set[VA_A bits] | tag=pfn_q   |
> >   +-------------------------------+
> >
> > migrate (kcompactd):
> >   1. copy pfn_q --> pfn_r
> >   2. free pfn_q
> >   3. pte: VA_a -> pfn_r
> >   4. pte_mkold(pte) --> !young
> >   5. ICIALLUIS skipped (because !young)
> >
> > pfn_src reused (OOM pressure):
> >   pte: VA_B -> pfn_q (different code)
> >
> > bug:
> >                 Core                      Physical Memory
> >   +-------------------------------+     +------------------+
> >   | TLB (empty)                   |     | pfn_r: old code  |
> >   +-------------------------------+     | pfn_q: new code  |
> >   | I-cache                       |     +------------------+
> >   |  set[VA_A bits] | tag=pfn_q   |<--- wrong instructions
> >   +-------------------------------+
> 
> (nit: Do you have pfn_r and pfn_q mixed up in the "Physical Memory" box?)

No, I don't think so. The intent was to show that whatever was copied
from pfn_q is now in pfn_r while the old page (pfn_q) is now mapped to
VA_B with new code/data. Maybe a classic case of poor naming on my part
here. :-)

> 
> > This was verified on ba16-based board (i.MX6Quad/Dual, Cortex-A9) by
> > instrumenting the migration code to track recently migrated pages in a
> > ring buffer and then dumping them in the undefined instruction fault
> > handler. The bug can be triggered with `stress-ng':
> >
> >   stress-ng --vm 4 --vm-bytes 2G --vm-method zero-one --verify
> >
> > Note that the system we tested on has only 2G of memory, so the test
> > triggered the OOM-killer in our case.
> >
> > Fixes: 1971188aa196 ("ARM: 7985/1: mm: implement pte_accessible for faulting mappings")
> > Signed-off-by: Brian Ruley <brian.ruley@gehealthcare.com>
> > ---
> >  arch/arm/include/asm/pgtable.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> > index 6fa9acd6a7f5..e3a5b4a9a65f 100644
> > --- a/arch/arm/include/asm/pgtable.h
> > +++ b/arch/arm/include/asm/pgtable.h
> > @@ -185,7 +185,7 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
> >  #define pte_exec(pte)                (pte_isclear((pte), L_PTE_XN))
> >
> >  #define pte_valid_user(pte)  \
> > -     (pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte))
> > +     (pte_valid(pte) && pte_isset((pte), L_PTE_USER))
> 
> This patch is from twelve years ago, so please forgive me for having
> forgotten all of the details. However, my recollection is that when using
> the classic/!lpae format (as you will be on Cortex-A9), page aging is
> implemented by using invalid (translation faulting) ptes for 'old'
> mappings.
> 
> So in the case you describe, we may well elide the I-cache maintenance,
> but won't we also put down an invalid pte? If we later take a fault
> on that, we should then perform the cache maintenance when installing
> the young entry (via ptep_set_access_flags()). The more interesting part
> is probably when the mapping for 'VA_B' is installed to map 'pfn_q' but,
> again, I would've expected the cache maintenance to happen just prior to
> installing the valid (young) mapping.
> 
> Please can you help me to understand the problem better?
> 
> Will

Hi,

I am, by no means, a domain expert either so I'll be deferring to your
judgement. That said, I believe what you said is correct and the
expectation is that we will later fault and then flush the cache and
fault.

However, in the case I describe, if VA_B is mapped immediately to pfn_q
after it been has unmapped and freed for VA_A, then it's quite possible
that the page is still indexed in the cache. The hypothesis is that if
VA_A and VA_B land in the same I-cache set and VA_A old cache entry
still exists (tagged with pfn_q), then the CPU can fetch stale
instructions because the tag will match. That's one reason why we need
to invalidate the cache, but that will be skipped in the path:

    migrate_pages
     migrate_pages_batch
      migrate_folio_move
       remove_migration_ptes
        remove_migration_pte
         set_pte_at
          set_ptes
           __sync_icache_dcache  (skipped if !young)
            set_pte_ext

And migrated pages are always marked old:

mm/migrate.c=static bool remove_migration_pte(struct folio *folio,
mm/migrate.c:           if (!softleaf_is_migration_young(entry))
mm/migrate.c:                   pte = pte_mkold(pte);

include/linux/leafops.h:
static inline bool softleaf_is_migration_young(softleaf_t entry)
{
        VM_WARN_ON_ONCE(!softleaf_is_migration(entry));

        if (migration_entry_supports_ad())
                return swp_offset(entry) & SWP_MIG_YOUNG;
        /* Keep the old behavior of aging page after migration */
        return false;
}

I might be misunderstanding something, this took us a while to figure
out. But the patch seems to work for us. I hope I explained it a bit
better now.

Best regards,
Brian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-09 15:17   ` Brian Ruley
@ 2026-04-09 16:00     ` Russell King (Oracle)
  2026-04-10 11:01       ` Brian Ruley
  0 siblings, 1 reply; 10+ messages in thread
From: Russell King (Oracle) @ 2026-04-09 16:00 UTC (permalink / raw)
  To: Brian Ruley; +Cc: Will Deacon, Steve Capper, linux-arm-kernel, linux-kernel

On Thu, Apr 09, 2026 at 06:17:36PM +0300, Brian Ruley wrote:
> However, in the case I describe, if VA_B is mapped immediately to pfn_q
> after it been has unmapped and freed for VA_A, then it's quite possible
> that the page is still indexed in the cache.

True.

> The hypothesis is that if
> VA_A and VA_B land in the same I-cache set and VA_A old cache entry
> still exists (tagged with pfn_q), then the CPU can fetch stale
> instructions because the tag will match. That's one reason why we need
> to invalidate the cache, but that will be skipped in the path:
> 
>     migrate_pages
>      migrate_pages_batch
>       migrate_folio_move
>        remove_migration_ptes
>         remove_migration_pte
>          set_pte_at
>           set_ptes
>            __sync_icache_dcache  (skipped if !young)
>             set_pte_ext

In this case, if the old PTE was marked !young, then the new PTE will
have:
	pte = pte_mkold(pte);

on it, which marks it !young. As you say, __sync_icache_dcache() will
be skipped. While a PTE entry will be set for the kernel, the code in
set_pte_ext() will *not* establish a hardware PTE entry. For the
2-level pte code:

        tst     r1, #L_PTE_YOUNG	@ <- results in Z being set
        tstne   r1, #L_PTE_VALID	@ <- not executed
        eorne   r1, r1, #L_PTE_NONE	@ <- not executed
        tstne   r1, #L_PTE_NONE		@ <- not executed
        moveq   r3, #0			@ <- hardware PTE value
 ARM(   str     r3, [r0, #2048]! )	@ <- writes hardware PTE

So, for a !young PTE, the hardware PTE entry is written as zero,
which means accesses should fault, which will then cause the PTE to
be marked young.

For the 3-level case, the L_PTE_YOUNG bit corresponds with the AF bit
in the PTE, and there aren't split Linux / hardware PTE entries. AF
being clear should result in a page fault being generated for the
kernel to handle making the PTE young.

In both of these cases, set_ptes() will need to be called with the
updated PTE which will now be marked young, and that will result in
the I-cache being flushed.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-09 16:00     ` Russell King (Oracle)
@ 2026-04-10 11:01       ` Brian Ruley
  2026-04-10 11:18         ` Russell King (Oracle)
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Ruley @ 2026-04-10 11:01 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Will Deacon, Steve Capper, linux-arm-kernel, linux-kernel

On Apr 09, Russell King (Oracle) wrote:
> 
> On Thu, Apr 09, 2026 at 06:17:36PM +0300, Brian Ruley wrote:
> > However, in the case I describe, if VA_B is mapped immediately to pfn_q
> > after it been has unmapped and freed for VA_A, then it's quite possible
> > that the page is still indexed in the cache.
> 
> True.
> 
> > The hypothesis is that if
> > VA_A and VA_B land in the same I-cache set and VA_A old cache entry
> > still exists (tagged with pfn_q), then the CPU can fetch stale
> > instructions because the tag will match. That's one reason why we need
> > to invalidate the cache, but that will be skipped in the path:
> >
> >     migrate_pages
> >      migrate_pages_batch
> >       migrate_folio_move
> >        remove_migration_ptes
> >         remove_migration_pte
> >          set_pte_at
> >           set_ptes
> >            __sync_icache_dcache  (skipped if !young)
> >             set_pte_ext
> 
> In this case, if the old PTE was marked !young, then the new PTE will
> have:
>         pte = pte_mkold(pte);
> 
> on it, which marks it !young. As you say, __sync_icache_dcache() will
> be skipped. While a PTE entry will be set for the kernel, the code in
> set_pte_ext() will *not* establish a hardware PTE entry. For the
> 2-level pte code:
> 
>         tst     r1, #L_PTE_YOUNG        @ <- results in Z being set
>         tstne   r1, #L_PTE_VALID        @ <- not executed
>         eorne   r1, r1, #L_PTE_NONE     @ <- not executed
>         tstne   r1, #L_PTE_NONE         @ <- not executed
>         moveq   r3, #0                  @ <- hardware PTE value
>  ARM(   str     r3, [r0, #2048]! )      @ <- writes hardware PTE
> 
> So, for a !young PTE, the hardware PTE entry is written as zero,
> which means accesses should fault, which will then cause the PTE to
> be marked young.
> 
> For the 3-level case, the L_PTE_YOUNG bit corresponds with the AF bit
> in the PTE, and there aren't split Linux / hardware PTE entries. AF
> being clear should result in a page fault being generated for the
> kernel to handle making the PTE young.
> 
> In both of these cases, set_ptes() will need to be called with the
> updated PTE which will now be marked young, and that will result in
> the I-cache being flushed.

Hi Russell,

Thank you for the clarification, this is very educational for me.
I understand your scepticism, and I can't explain what's going on based
on what you replied. However, I do honestly believe there is a problem
here. I'll share the exact testing details and the instrumentation
we added that convinced us to reach out at the end. One idea we also
had was that could cache aliasing be happening here.

To clarify any potential misunderstanding, we've observed the
following:

- Sporadic SIGILL and SIGSEGV under memory pressure
- Scales with core count, i.e., quad core more likely to reproduce
  than dual core. We haven't observed an issue on single core.
- Coredumps show valid instructions at the faulting PC.
  The CPU executed something different from what's in memory.
  This pointed us to stale I-cache.
- Instrumentation indicates a correlation.
  A per-CPU ring buffer tracking exec page migrations was dumped on
  SIGILL. The faulting PC matched a recently migrated pages.
- We started seeing this after upgrade 6.1->6.12->6.18. We bisected
  two commits which had an impact, but we weren't convinced that
  either was the root cause: 5dfab109d5193e6c224d96cabf90e9cc2c039884
  and 6faea3422e3b4e8de44a55aa3e6e843320da66d2.
- Failed processes include systemd, tar, bash, ...
- Debug options, e.g., page poisoning, seems to hide the bug


> So you're saying that stress-ng doesn't reproduce this bug but
triggers the OOM-killer... confused.

Apologies for the confusion. I meant that with `stress-ng' we created
the memory pressure and OOM might have played a role in exposing the
"bug" as we (at the time) believed that anything that would trigger
memory free/reclaims and page migration was the key. One note I'll add
is that in our test we invoked stress-ng for 2 minutes (--timeout 2m)
and after each we would reboot the device. We had observed that reboots
seemed to have a discernible effect on the occurence in earlier testing
so we kept that in. I'm beginning to doubt if it had an effect now,
and unfortunately it's all anecdotal.

One more thing, even if you don't accept the patch, is this patch
harmful in any way or is it just sub-optimal?

I'll send the instrumentation patch as a follow-up, migh be there's a
flaw in it.

Best regards,
Brian

###TESTING###

1. stress-ng --vm 4 --vm-bytes 2G --vm-method zero-one --verify \
             --timeout 2m
2. reboot
3. repeat

Cleaned up logs of instrumentation:
```
kernel: [  104.610248] SIGILL at b6e6f1c0 pid 896, recent exec migrations:
kernel: [  104.610313]   cpu0: addr=b6e6f000 old_pfn=467d3 new_pfn=577fe pid=34 flushed=0
[...]
kernel: [  456.066661] SIGILL at b6d99f40 pid 455, recent exec migrations:
[...]
kernel: [  456.066963]   cpu0: addr=b6d99000 old_pfn=44270 new_pfn=7c9ea pid=34 flushed=0
[...] 
```


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-10 11:01       ` Brian Ruley
@ 2026-04-10 11:18         ` Russell King (Oracle)
  2026-04-10 11:43           ` [RFC PATCH] test: " Brian Ruley
  0 siblings, 1 reply; 10+ messages in thread
From: Russell King (Oracle) @ 2026-04-10 11:18 UTC (permalink / raw)
  To: Brian Ruley; +Cc: Will Deacon, Steve Capper, linux-arm-kernel, linux-kernel

On Fri, Apr 10, 2026 at 02:01:41PM +0300, Brian Ruley wrote:
> Thank you for the clarification, this is very educational for me.
> I understand your scepticism, and I can't explain what's going on based
> on what you replied. However, I do honestly believe there is a problem
> here. I'll share the exact testing details and the instrumentation
> we added that convinced us to reach out at the end. One idea we also
> had was that could cache aliasing be happening here.
> 
> To clarify any potential misunderstanding, we've observed the
> following:
> 
> - Sporadic SIGILL and SIGSEGV under memory pressure
> - Scales with core count, i.e., quad core more likely to reproduce
>   than dual core. We haven't observed an issue on single core.
> - Coredumps show valid instructions at the faulting PC.
>   The CPU executed something different from what's in memory.
>   This pointed us to stale I-cache.
> - Instrumentation indicates a correlation.
>   A per-CPU ring buffer tracking exec page migrations was dumped on
>   SIGILL. The faulting PC matched a recently migrated pages.
> - We started seeing this after upgrade 6.1->6.12->6.18. We bisected
>   two commits which had an impact, but we weren't convinced that
>   either was the root cause: 5dfab109d5193e6c224d96cabf90e9cc2c039884
>   and 6faea3422e3b4e8de44a55aa3e6e843320da66d2.
> - Failed processes include systemd, tar, bash, ...
> - Debug options, e.g., page poisoning, seems to hide the bug
> 
> 
> > So you're saying that stress-ng doesn't reproduce this bug but
> triggers the OOM-killer... confused.
> 
> Apologies for the confusion. I meant that with `stress-ng' we created
> the memory pressure and OOM might have played a role in exposing the
> "bug" as we (at the time) believed that anything that would trigger
> memory free/reclaims and page migration was the key. One note I'll add
> is that in our test we invoked stress-ng for 2 minutes (--timeout 2m)
> and after each we would reboot the device. We had observed that reboots
> seemed to have a discernible effect on the occurence in earlier testing
> so we kept that in. I'm beginning to doubt if it had an effect now,
> and unfortunately it's all anecdotal.
> 
> One more thing, even if you don't accept the patch, is this patch
> harmful in any way or is it just sub-optimal?
> 
> I'll send the instrumentation patch as a follow-up, migh be there's a
> flaw in it.

I'll try it - I have Cortex A9 systems (some which I rely on...)

Please can you also try to track the history of what happens for
the PTEs corresponding to the old and new PFN?

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH] test: mm/arm: pgtable: remove young bit check for pte_valid_user
  2026-04-10 11:18         ` Russell King (Oracle)
@ 2026-04-10 11:43           ` Brian Ruley
  0 siblings, 0 replies; 10+ messages in thread
From: Brian Ruley @ 2026-04-10 11:43 UTC (permalink / raw)
  To: Russell King
  Cc: Russell King, Will Deacon, Steve Capper, linux-arm-kernel,
	linux-kernel, Brian Ruley

Instrumentation to print recently migrated pages in undefined
instruction handler. This was used to determine if the faulting address
was migrated earlier.

Signed-off-by: Brian Ruley <brian.ruley@gehealthcare.com>
---
Not intended for integration. This is just to share the testing details.
---
 arch/arm/kernel/traps.c |  5 ++++
 mm/migrate.c            | 53 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index afbd2ebe5c39..64ef872f1555 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -28,6 +28,8 @@
 #include <linux/irq.h>
 #include <linux/vmalloc.h>
 
+void migrate_exec_log_dump(unsigned long fault_addr);
+
 #include <linux/atomic.h>
 #include <asm/cacheflush.h>
 #include <asm/exception.h>
@@ -490,6 +492,9 @@ asmlinkage void do_undefinstr(struct pt_regs *regs)
 		dump_instr(KERN_INFO, regs);
 	}
 #endif
+	if (user_mode(regs))
+		migrate_exec_log_dump((unsigned long)pc);
+
 	arm_notify_die("Oops - undefined instruction", regs,
 		       SIGILL, ILL_ILLOPC, pc, 0, 6);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 2c3d489ecf51..987d0376b433 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -48,6 +48,54 @@
 
 #include <trace/events/migrate.h>
 
+
+/* Debug: track recent exec page migrations */
+#define MIGRATE_EXEC_LOG_SIZE 8
+struct migrate_exec_entry {
+	unsigned long addr;
+	unsigned long old_pfn;
+	unsigned long new_pfn;
+	unsigned int pid;
+	bool flushed;
+};
+static DEFINE_PER_CPU(struct migrate_exec_entry[MIGRATE_EXEC_LOG_SIZE], migrate_exec_log);
+static DEFINE_PER_CPU(unsigned int, migrate_exec_idx);
+
+void migrate_exec_log_add(unsigned long addr, unsigned long old_pfn,
+			  unsigned long new_pfn, bool flushed)
+{
+	unsigned int idx = __this_cpu_read(migrate_exec_idx);
+	struct migrate_exec_entry *log = this_cpu_ptr(migrate_exec_log);
+	struct migrate_exec_entry *e = &log[idx];
+
+	e->addr = addr;
+	e->old_pfn = old_pfn;
+	e->new_pfn = new_pfn;
+	e->pid = current->pid;
+	e->flushed = flushed;
+	__this_cpu_write(migrate_exec_idx, (idx + 1) % MIGRATE_EXEC_LOG_SIZE);
+}
+
+void migrate_exec_log_dump(unsigned long fault_addr)
+{
+	int cpu;
+
+	pr_err("SIGILL at %lx pid %d, recent exec migrations:\n",
+	       fault_addr, current->pid);
+	for_each_online_cpu(cpu) {
+		struct migrate_exec_entry *log = per_cpu(migrate_exec_log, cpu);
+		int i;
+		for (i = 0; i < MIGRATE_EXEC_LOG_SIZE; i++) {
+			if (log[i].addr == 0)
+				continue;
+			pr_err("  cpu%d: addr=%lx old_pfn=%lx new_pfn=%lx pid=%d flushed=%d%s\n",
+			       cpu, log[i].addr, log[i].old_pfn, log[i].new_pfn,
+			       log[i].pid, log[i].flushed,
+			       (PAGE_ALIGN(fault_addr) == PAGE_ALIGN(log[i].addr)) ?
+			       " *** MATCH ***" : "");
+		}
+	}
+}
 #include "internal.h"
 #include "swap.h"
 
@@ -434,6 +482,11 @@ static bool remove_migration_pte(struct folio *folio,
 			else
 				folio_add_file_rmap_pte(folio, new, vma);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
+
+			if (vma->vm_flags & VM_EXEC)
+				migrate_exec_log_add(pvmw.address,
+					swp_offset(entry), page_to_pfn(new),
+					pte_young(pte));
 		}
 		if (READ_ONCE(vma->vm_flags) & VM_LOCKED)
 			mlock_drain_local();
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-10 11:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 12:54 [PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user Brian Ruley
2026-04-09 13:56 ` Will Deacon
2026-04-09 14:21   ` Russell King (Oracle)
2026-04-09 14:43   ` Russell King (Oracle)
2026-04-09 15:17   ` Brian Ruley
2026-04-09 16:00     ` Russell King (Oracle)
2026-04-10 11:01       ` Brian Ruley
2026-04-10 11:18         ` Russell King (Oracle)
2026-04-10 11:43           ` [RFC PATCH] test: " Brian Ruley
2026-04-09 14:15 ` [PATCH] " Russell King (Oracle)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox