* [PATCH] mm/damon: fix stale TLB young-state handling on arm64
@ 2026-05-25 14:48 Kunwu Chan
2026-05-25 17:46 ` SeongJae Park
0 siblings, 1 reply; 3+ messages in thread
From: Kunwu Chan @ 2026-05-25 14:48 UTC (permalink / raw)
To: sj, akpm; +Cc: damon, linux-mm, linux-kernel, Kunwu Chan, Wang Lian
From: Kunwu Chan <chentao@kylinos.cn>
damon_ptep_mkold() clears the PTE Access Flag so that a later
access will set it again and damon_folio_young() can observe it
via pte_young().
On arm64, however, ptep_test_and_clear_young() clears AF in the
page tables without invalidating the corresponding TLB entry.
Subsequent accesses can therefore continue hitting a stale TLB
entry without a page table walk. The PTE AF bit stays clear,
pte_young() reports false, and DAMON treats the region as
unaccessed.
folio_set_idle() does not help here. It updates only software
state, and accesses through a stale TLB entry do not clear the
idle flag.
As a result, nr_accesses stays low regardless of the real access
pattern. DAMOS schemes fail to match, WSS estimation reports
zero, and actions like pageout never trigger.
Fix this by switching to ptep_clear_flush_young() and
pmdp_clear_flush_young().
On arm64 these perform the required TLB invalidation after
clearing AF. The invalidation is deferred, but still sufficient
for DAMON's sampling granularity.
On x86, ptep_clear_flush_young() is equivalent to
ptep_test_and_clear_young() for base pages, so there is no
behavioral change. pmdp_clear_flush_young() additionally performs
a flush at PMD level, matching the existing x86 implementation.
On powerpc, riscv, and s390, the clear_flush variants currently
map back to test_and_clear implementations, so this patch does not
change their behavior.
Reproduced on arm64 (128 CPUs, 7.1.0-rc4):
before:
WSS estimation: 50th percentile error 100% (reported as zero)
apply_interval: schemes never tried
after:
WSS estimation: 50th percentile error 0.08%
apply_interval: passes
Co-developed-by: Wang Lian <lianux.mm@gmail.com>
Signed-off-by: Wang Lian <lianux.mm@gmail.com>
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Tested-by: Kunwu Chan <chentao@kylinos.cn>
---
mm/damon/ops-common.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index 8c6d61342..33d689c14 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -63,7 +63,7 @@ void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr
* device aspects.
*/
if (likely(pte_present(pteval)))
- young |= ptep_test_and_clear_young(vma, addr, pte);
+ young |= ptep_clear_flush_young(vma, addr, pte);
young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
if (young)
folio_set_young(folio);
@@ -90,7 +90,7 @@ void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr
return;
if (likely(pmd_present(pmdval)))
- young |= pmdp_test_and_clear_young(vma, addr, pmd);
+ young |= pmdp_clear_flush_young(vma, addr, pmd);
young |= mmu_notifier_clear_young(vma->vm_mm, addr, addr + HPAGE_PMD_SIZE);
if (young)
folio_set_young(folio);
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] mm/damon: fix stale TLB young-state handling on arm64
2026-05-25 14:48 [PATCH] mm/damon: fix stale TLB young-state handling on arm64 Kunwu Chan
@ 2026-05-25 17:46 ` SeongJae Park
2026-05-26 8:57 ` Kunwu Chan
0 siblings, 1 reply; 3+ messages in thread
From: SeongJae Park @ 2026-05-25 17:46 UTC (permalink / raw)
To: Kunwu Chan
Cc: SeongJae Park, akpm, damon, linux-mm, linux-kernel, Kunwu Chan,
Wang Lian
On Mon, 25 May 2026 22:48:46 +0800 Kunwu Chan <kunwu.chan@linux.dev> wrote:
> From: Kunwu Chan <chentao@kylinos.cn>
Thank you for this great patch!
>
> damon_ptep_mkold() clears the PTE Access Flag so that a later
> access will set it again and damon_folio_young() can observe it
> via pte_young().
>
> On arm64, however, ptep_test_and_clear_young() clears AF in the
> page tables without invalidating the corresponding TLB entry.
> Subsequent accesses can therefore continue hitting a stale TLB
> entry without a page table walk. The PTE AF bit stays clear,
> pte_young() reports false, and DAMON treats the region as
> unaccessed.
>
> folio_set_idle() does not help here. It updates only software
> state, and accesses through a stale TLB entry do not clear the
> idle flag.
>
> As a result, nr_accesses stays low regardless of the real access
> pattern. DAMOS schemes fail to match, WSS estimation reports
> zero, and actions like pageout never trigger.
You are correct. Nonetheless, we intentionally designed DAMON in this way, to
avoid the performance overhead from the added TLB flushes. Meanwhile, we
believed this is ok in terms of the monitoring results accuracy on real world
production environment, becasue such environments would have large amount of
working set that flushes TLB buffers anyway. We decided to take this way after
a measurement [1].
And apparently you found this behavior as problematic on a test environment
that having small size of working set. We found a similar issue in DAMON
selftest, and we updated the test [2] to simulate the expected real world
production environments, rather than changing DAMON.
But, this kind of question is recurring. In addition to the previous
discussion, there were a few private inqueries for this issue. And though the
real world production environment is the priamry target of DAMON, I understand
it is better to support testing environment, too. So, I think it is better to
make some changes for this issue, if it doesn't make other problems.
>
> Fix this by switching to ptep_clear_flush_young() and
> pmdp_clear_flush_young().
>
> On arm64 these perform the required TLB invalidation after
> clearing AF. The invalidation is deferred, but still sufficient
> for DAMON's sampling granularity.
>
> On x86, ptep_clear_flush_young() is equivalent to
> ptep_test_and_clear_young() for base pages, so there is no
> behavioral change. pmdp_clear_flush_young() additionally performs
> a flush at PMD level, matching the existing x86 implementation.
>
> On powerpc, riscv, and s390, the clear_flush variants currently
> map back to test_and_clear implementations, so this patch does not
> change their behavior.
This change seems much nicer and might be more optimized than my simple
implementation of tlb flush [1] that I tested before.
>
> Reproduced on arm64 (128 CPUs, 7.1.0-rc4):
>
> before:
> WSS estimation: 50th percentile error 100% (reported as zero)
> apply_interval: schemes never tried
>
> after:
> WSS estimation: 50th percentile error 0.08%
> apply_interval: passes
And nice test results. I guess you are referring to the tests in damon-tests?
Clarifying the context would be nice.
Also, have you had a chance to measure the performance impact?
So, I'd like to have this change. But, unless we have very clear evidence
showing this change is not increasing the performance overhead, I'd prefer
making this as an optional feature.
For the user interface, we could add a new sysfs file for the option, say,
'flush_sample_tlb' under 'monitoring_attrs' directory.
For long term, I'm planning [1] to extend the data attributes monitoring
feature so that data access becomes just one of the attributes. Once it is
done, we could control this tlb flush option using the probes interface.
I was initially thinking about asking Kunwu to wait until the data attributes
monitoring extension is done, and add this tlb flush option on top of that.
Because, otherwise, we may need to deprecate 'flush_sample_tlb' after the
extension is done. But, we will anyway need to deprecate a few interfaces
including 'nr_accesses'. Doing the deprecation of 'flush_sample_tlb' together
with it shouldn't be huge amount of overhead.
So, unless Kunwu and Lian has other concerns, I'd suggest the
'flush_sample_tlb' path.
[...]
[1] https://lore.kernel.org/20200403103059.12762-1-sjpark@amazon.com/
[2] https://lore.kernel.org/20260117020731.226785-3-sj@kernel.org/
Thanks,
SJ
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] mm/damon: fix stale TLB young-state handling on arm64
2026-05-25 17:46 ` SeongJae Park
@ 2026-05-26 8:57 ` Kunwu Chan
0 siblings, 0 replies; 3+ messages in thread
From: Kunwu Chan @ 2026-05-26 8:57 UTC (permalink / raw)
To: SeongJae Park, Kunwu Chan, Wang Lian; +Cc: akpm, damon, linux-mm, linux-kernel
May 26, 2026 at 1:46 AM, "SeongJae Park" <sj@kernel.org mailto:sj@kernel.org?to=%22SeongJae%20Park%22%20%3Csj%40kernel.org%3E > wrote:
>
> On Mon, 25 May 2026 22:48:46 +0800 Kunwu Chan <kunwu.chan@linux.dev> wrote:
>
> >
> > From: Kunwu Chan <chentao@kylinos.cn>
> >
> Thank you for this great patch!
Thanks for the reply.
>
> >
> > damon_ptep_mkold() clears the PTE Access Flag so that a later
> > access will set it again and damon_folio_young() can observe it
> > via pte_young().
> >
> > On arm64, however, ptep_test_and_clear_young() clears AF in the
> > page tables without invalidating the corresponding TLB entry.
> > Subsequent accesses can therefore continue hitting a stale TLB
> > entry without a page table walk. The PTE AF bit stays clear,
> > pte_young() reports false, and DAMON treats the region as
> > unaccessed.
> >
> > folio_set_idle() does not help here. It updates only software
> > state, and accesses through a stale TLB entry do not clear the
> > idle flag.
> >
> > As a result, nr_accesses stays low regardless of the real access
> > pattern. DAMOS schemes fail to match, WSS estimation reports
> > zero, and actions like pageout never trigger.
> >
> You are correct. Nonetheless, we intentionally designed DAMON in this way, to
> avoid the performance overhead from the added TLB flushes. Meanwhile, we
> believed this is ok in terms of the monitoring results accuracy on real world
> production environment, becasue such environments would have large amount of
> working set that flushes TLB buffers anyway. We decided to take this way after
> a measurement [1].
>
Understood, thanks for the detailed explanation and context. We were
too focused on the testing issue and overlooked the original tradeoff
between monitoring accuracy and the overhead of TLB flushing.
> And apparently you found this behavior as problematic on a test environment
> that having small size of working set. We found a similar issue in DAMON
> selftest, and we updated the test [2] to simulate the expected real world
> production environments, rather than changing DAMON.
>
That makes sense.
The selftest change in [2] also clarified why DAMON normally works fine
in production workloads. Large and active workloads naturally exceed
TLB coverage, so accessed-bit updates eventually become visible again.
So I agree that adjusting the test instead of forcing TLB flushes in
DAMON was the right tradeoff there.
> But, this kind of question is recurring. In addition to the previous
> discussion, there were a few private inqueries for this issue. And though the
> real world production environment is the priamry target of DAMON, I understand
> it is better to support testing environment, too. So, I think it is better to
> make some changes for this issue, if it doesn't make other problems.
>
At the same time, I also understand why this keeps coming up. Test
environments can differ a lot in workload size, THP usage, VM setup,
and hardware configuration, so stable and reproducible testing is not
always easy across different systems.
So I agree it makes sense to improve this area as long as it does not
add noticeable overhead or cause problems for the normal production
case.
> >
> > Fix this by switching to ptep_clear_flush_young() and
> > pmdp_clear_flush_young().
> >
> > On arm64 these perform the required TLB invalidation after
> > clearing AF. The invalidation is deferred, but still sufficient
> > for DAMON's sampling granularity.
> >
> > On x86, ptep_clear_flush_young() is equivalent to
> > ptep_test_and_clear_young() for base pages, so there is no
> > behavioral change. pmdp_clear_flush_young() additionally performs
> > a flush at PMD level, matching the existing x86 implementation.
> >
> > On powerpc, riscv, and s390, the clear_flush variants currently
> > map back to test_and_clear implementations, so this patch does not
> > change their behavior.
> >
> This change seems much nicer and might be more optimized than my simple
> implementation of tlb flush [1] that I tested before.
>
Thanks. Yes, we were mainly trying to reuse the existing
arch-optimized helpers as much as possible.
> >
> > Reproduced on arm64 (128 CPUs, 7.1.0-rc4):
> >
> > before:
> > WSS estimation: 50th percentile error 100% (reported as zero)
> > apply_interval: schemes never tried
> >
> > after:
> > WSS estimation: 50th percentile error 0.08%
> > apply_interval: passes
> >
> And nice test results. I guess you are referring to the tests in damon-tests?
> Clarifying the context would be nice.
>
Yes, those results are from: make -C tools/testing/selftests/damon run_tests
on the arm64 test machine mentioned above.
The before/after summary was extracted from the relevant failing tests
(sysfs_update_schemes_tried_regions_wss_estimation.py and
damos_apply_interval.py) for brevity.
> Also, have you had a chance to measure the performance impact?
We haven't done detailed performance measurements yet, but we can try to
collect some numbers for the flush overhead on a few different setups.
> So, I'd like to have this change. But, unless we have very clear evidence
> showing this change is not increasing the performance overhead, I'd prefer
> making this as an optional feature.
>
We agree that making it optional sounds safer unless we have solid
evidence showing the overhead is negligible. Keeping the current
default behavior for production workloads also makes sense to me.
> For the user interface, we could add a new sysfs file for the option, say,
> 'flush_sample_tlb' under 'monitoring_attrs' directory.
>
The proposed 'flush_sample_tlb' interface under monitoring_attrs sounds
reasonable to me as well.
> For long term, I'm planning [1] to extend the data attributes monitoring
> feature so that data access becomes just one of the attributes. Once it is
> done, we could control this tlb flush option using the probes interface.
>
> I was initially thinking about asking Kunwu to wait until the data attributes
> monitoring extension is done, and add this tlb flush option on top of that.
> Because, otherwise, we may need to deprecate 'flush_sample_tlb' after the
> extension is done. But, we will anyway need to deprecate a few interfaces
> including 'nr_accesses'. Doing the deprecation of 'flush_sample_tlb' together
> with it shouldn't be huge amount of overhead.
>
> So, unless Kunwu and Lian has other concerns, I'd suggest the
> 'flush_sample_tlb' path.
Yes, we also think deprecating it later together with interfaces such as
'nr_accesses' should be manageable once the data attributes monitoring
extension is ready.
Thanks, Kunwu
>
> [...]
>
> [1] https://lore.kernel.org/20200403103059.12762-1-sjpark@amazon.com/
> [2] https://lore.kernel.org/20260117020731.226785-3-sj@kernel.org/
>
> Thanks,
> SJ
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-05-26 8:58 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25 14:48 [PATCH] mm/damon: fix stale TLB young-state handling on arm64 Kunwu Chan
2026-05-25 17:46 ` SeongJae Park
2026-05-26 8:57 ` Kunwu Chan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox