* [PATCH v1 0/2] Optimize S2 page splitting
@ 2026-06-10 20:21 Leonardo Bras
2026-06-10 20:21 ` [PATCH v1 1/2] KVM: arm64: Introduce KVM_PGTABLE_WALK_SKIP_LEVEL* walk flags Leonardo Bras
2026-06-10 20:21 ` [PATCH v1 2/2] KVM: arm64: Make stage2_split_walker() skip unnecessary walks Leonardo Bras
0 siblings, 2 replies; 3+ messages in thread
From: Leonardo Bras @ 2026-06-10 20:21 UTC (permalink / raw)
To: Marc Zyngier, Oliver Upton, Joey Gouly, Steffen Eiden,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
Fuad Tabba, Leonardo Bras, Raghavendra Rao Ananta
Cc: linux-arm-kernel, kvmarm, linux-kernel
While playing with dirty-bit tracking, I decided to take a look on how page
splitting works. Found out all entries are walked, even though we can infer,
for instance that:
- If a level-3 entry is walked, it means the parent level-2 entry is split
- If a split just succeeded in an table entry, it means all children nodes
are already split
This patches' idea is to introduce new walking flags to skip pagetable
levels 0-3.
The idea of skipping child nodes was also tested, but it was marginally
slower than just skipping levels, so it was discarted.
Optimization measured on two scenarios involving eager-splitting on a
VM with 1 memslot of 64GB:
- Scenario 1: No manual protect, whole memslot split at dirty-track enable
(KVM_SET_USER_MEMORY_REGION2 ioctl with KVM_MEM_LOG_DIRTY_PAGES)
- Split happens only once, whole region
- Evalutes improved batch performance of splitting
- Scenario 2: Manual protect, split happens during every dirty-bit clean
(KVM_CLEAR_DIRTY_LOG ioctl), average for 2 iterations.
- Split called multiple times, for smaller 64-page sections.
- Evaluate improved performance for multiple calls
Scenario 1, improvement on dirty-track enable ioctl for the memslot:
- Memory was already split (4k pages): -35.47% runtime (stdev 5.63%)
- THP backed memory: -11.94% runtime (stdev 2.55%)
- 64x1GB hugetlb memory: -14.46% runtime (stdev 2.68%)
Scenario 2, improvement on dirty-log clean ioctl for the memslot:
- Memory was already split (4k pages): -26.36% runtime (stdev 3.32%)
- THP backed memory: -12.05% runtime (stdev 0.37%)
- 64x1GB hugetlb memory: -13.87% runtime (stdev 0.86%)
For collecting above numbers, the following script was ran in both vanilla
and patched kernels, with kernel parameter 'default_hugepagesz=1G', on an
AmpereOne with 256GB RAM.
--- dirty_test.sh
#!/bin/bash
filename=$(uname -r |cut -d'-' -f 4-)
run_test(){
uname -a
cat /proc/cmdline
#prepare
sudo bash -c 'echo 64 > /proc/sys/vm/nr_hugepages'
./dirty_log_perf_test -g -b 64G
./dirty_log_perf_test -g -b 64G -s anonymous_thp
./dirty_log_perf_test -g -b 64G -s shared_hugetlb
./dirty_log_perf_test -b 64G
./dirty_log_perf_test -b 64G -s anonymous_thp
./dirty_log_perf_test -b 64G -s shared_hugetlb
}
run_test 2>&1 | tee ${filename}
---
Above dirty_log_perf_test command is the standard kvm selftest found in the
kernel tree. It tested the following guest modes:
Testing guest mode: PA-bits:48, VA-bits:48, 4K pages
Testing guest mode: PA-bits:48, VA-bits:48, 16K pages
Testing guest mode: PA-bits:48, VA-bits:48, 64K pages
Testing guest mode: PA-bits:40, VA-bits:48, 4K pages
Testing guest mode: PA-bits:40, VA-bits:48, 16K pages
Testing guest mode: PA-bits:40, VA-bits:48, 64K pages
Performance numbers from above modes were used to calculate average and
stdev showed in the optimization results.
Changes since v1:
- Changed approach from return value to walk flags (Will Deacon)
- Discarted skip_child approach (Oliver Upton)
- Measured in real hardware, and from userspace perspective (Marc Zyngier)
- Better explanation of what and how numbers were collected
v1 Link: https://lore.kernel.org/all/20260515195904.2466381-1-leo.bras@arm.com/
Thanks!
Leo
Leonardo Bras (2):
KVM: arm64: Introduce KVM_PGTABLE_WALK_SKIP_LEVEL* walk flags
KVM: arm64: Make stage2_split_walker() skip unnecessary walks
arch/arm64/include/asm/kvm_pgtable.h | 13 +++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++--
2 files changed, 29 insertions(+), 2 deletions(-)
base-commit: acb7500801e98639f6d8c2d796ed9f64cba83d3a
--
2.54.0
^ permalink raw reply [flat|nested] 3+ messages in thread* [PATCH v1 1/2] KVM: arm64: Introduce KVM_PGTABLE_WALK_SKIP_LEVEL* walk flags
2026-06-10 20:21 [PATCH v1 0/2] Optimize S2 page splitting Leonardo Bras
@ 2026-06-10 20:21 ` Leonardo Bras
2026-06-10 20:21 ` [PATCH v1 2/2] KVM: arm64: Make stage2_split_walker() skip unnecessary walks Leonardo Bras
1 sibling, 0 replies; 3+ messages in thread
From: Leonardo Bras @ 2026-06-10 20:21 UTC (permalink / raw)
To: Marc Zyngier, Oliver Upton, Joey Gouly, Steffen Eiden,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
Fuad Tabba, Leonardo Bras, Raghavendra Rao Ananta
Cc: linux-arm-kernel, kvmarm, linux-kernel
Add the new walking flags that tell kvm_pgtable_walk() to skip lower levels
when walking the pagetables.
Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
arch/arm64/include/asm/kvm_pgtable.h | 13 +++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 15 ++++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 41a8687938eb..20c7c12e0e76 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -311,31 +311,44 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
* @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
* with other software walkers.
* @KVM_PGTABLE_WALK_IGNORE_EAGAIN: Don't terminate the walk early if
* the walker returns -EAGAIN.
* @KVM_PGTABLE_WALK_SKIP_BBM_TLBI: Visit and update table entries
* without Break-before-make's
* TLB invalidation.
* @KVM_PGTABLE_WALK_SKIP_CMO: Visit and update table entries
* without Cache maintenance
* operations required.
+ * @KVM_PGTABLE_WALK_SKIP_LEVEL0: Skip visiting level-0+ entries
+ * @KVM_PGTABLE_WALK_SKIP_LEVEL1: Skip visiting level-1+ entries
+ * @KVM_PGTABLE_WALK_SKIP_LEVEL2: Skip visiting level-2+ entries
+ * @KVM_PGTABLE_WALK_SKIP_LEVEL3: Skip visiting level-3 entries
*/
enum kvm_pgtable_walk_flags {
KVM_PGTABLE_WALK_LEAF = BIT(0),
KVM_PGTABLE_WALK_TABLE_PRE = BIT(1),
KVM_PGTABLE_WALK_TABLE_POST = BIT(2),
KVM_PGTABLE_WALK_SHARED = BIT(3),
KVM_PGTABLE_WALK_IGNORE_EAGAIN = BIT(4),
KVM_PGTABLE_WALK_SKIP_BBM_TLBI = BIT(5),
KVM_PGTABLE_WALK_SKIP_CMO = BIT(6),
+ KVM_PGTABLE_WALK_SKIP_LEVEL0 = BIT(7),
+ KVM_PGTABLE_WALK_SKIP_LEVEL1 = BIT(8),
+ KVM_PGTABLE_WALK_SKIP_LEVEL2 = BIT(9),
+ KVM_PGTABLE_WALK_SKIP_LEVEL3 = BIT(10),
};
+#define KVM_PGTABLE_WALK_SKIP_LEVELS (KVM_PGTABLE_WALK_SKIP_LEVEL0 | \
+ KVM_PGTABLE_WALK_SKIP_LEVEL1 | \
+ KVM_PGTABLE_WALK_SKIP_LEVEL2 | \
+ KVM_PGTABLE_WALK_SKIP_LEVEL3 )
+
struct kvm_pgtable_visit_ctx {
kvm_pte_t *ptep;
kvm_pte_t old;
void *arg;
struct kvm_pgtable_mm_ops *mm_ops;
u64 start;
u64 addr;
u64 end;
s8 level;
enum kvm_pgtable_walk_flags flags;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 91a7dfad6686..48d88a290a53 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -137,20 +137,33 @@ static bool kvm_pgtable_walk_continue(const struct kvm_pgtable_walker *walker,
* Ignore the return code altogether for walkers outside a fault handler
* (e.g. write protecting a range of memory) and chug along with the
* page table walk.
*/
if (r == -EAGAIN)
return walker->flags & KVM_PGTABLE_WALK_IGNORE_EAGAIN;
return !r;
}
+static __always_inline bool kvm_pgtable_skip_level(s8 level, enum kvm_pgtable_walk_flags flags)
+{
+ flags &= KVM_PGTABLE_WALK_SKIP_LEVELS;
+
+ if (likely(!flags))
+ return false;
+
+ if (level >= (fls(flags) - ffs(KVM_PGTABLE_WALK_SKIP_LEVELS)))
+ return true;
+
+ return false;
+}
+
static int __kvm_pgtable_walk(struct kvm_pgtable_walk_data *data,
struct kvm_pgtable_mm_ops *mm_ops, kvm_pteref_t pgtable, s8 level);
static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
struct kvm_pgtable_mm_ops *mm_ops,
kvm_pteref_t pteref, s8 level)
{
enum kvm_pgtable_walk_flags flags = data->walker->flags;
kvm_pte_t *ptep = kvm_dereference_pteref(data->walker, pteref);
struct kvm_pgtable_visit_ctx ctx = {
@@ -185,21 +198,21 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
* into a newly installed or replaced table.
*/
if (reload) {
ctx.old = READ_ONCE(*ptep);
table = kvm_pte_table(ctx.old, level);
}
if (!kvm_pgtable_walk_continue(data->walker, ret))
goto out;
- if (!table) {
+ if (!table || kvm_pgtable_skip_level(level + 1, ctx.flags)) {
data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
data->addr += kvm_granule_size(level);
goto out;
}
childp = (kvm_pteref_t)kvm_pte_follow(ctx.old, mm_ops);
ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
if (!kvm_pgtable_walk_continue(data->walker, ret))
goto out;
--
2.54.0
^ permalink raw reply related [flat|nested] 3+ messages in thread* [PATCH v1 2/2] KVM: arm64: Make stage2_split_walker() skip unnecessary walks
2026-06-10 20:21 [PATCH v1 0/2] Optimize S2 page splitting Leonardo Bras
2026-06-10 20:21 ` [PATCH v1 1/2] KVM: arm64: Introduce KVM_PGTABLE_WALK_SKIP_LEVEL* walk flags Leonardo Bras
@ 2026-06-10 20:21 ` Leonardo Bras
1 sibling, 0 replies; 3+ messages in thread
From: Leonardo Bras @ 2026-06-10 20:21 UTC (permalink / raw)
To: Marc Zyngier, Oliver Upton, Joey Gouly, Steffen Eiden,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
Fuad Tabba, Leonardo Bras, Raghavendra Rao Ananta
Cc: linux-arm-kernel, kvmarm, linux-kernel
Currently, when splitting a hugepage, all it's child and sibling nodes
will be walked, with the walker just returning earlier if there is nothing
to do. This means all pagetable entries in the splitting range get a
callback from the walker function, even if it was a level-3 entry.
Optimize splitting by skipping all level-3 entries, as they are already the
smallest block size and can't be split any further.
(i.e. set flag KVM_PGTABLE_WALK_SKIP_LEVEL3)
Optimization measured on two scenarios involving eager-splitting on a
VM with 1 memslot of 64GB:
- Scenario 1: No manual protect, whole memslot split at dirty-track enable
(KVM_SET_USER_MEMORY_REGION2 ioctl with KVM_MEM_LOG_DIRTY_PAGES)
- Scenario 2: Manual protect, split happens during dirty-bit clean
(KVM_CLEAR_DIRTY_LOG ioctl), average for 2 iterations.
Scenario 1, improvement on dirty-track enable for the memslot:
- Memory was already split (4k pages): -35.47% runtime
- THP backed memory: -11.94% runtime
- 64x1GB hugetlb memory: -14.46% runtime
Scenario 2, improvement on dirty-log clean for the memslot:
- Memory was already split (4k pages): -26.36% runtime
- THP backed memory: -12.05% runtime
- 64x1GB hugetlb memory: -13.87% runtime
Signed-off-by: Leonardo Bras <leo.bras@arm.com>
---
arch/arm64/kvm/hyp/pgtable.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 48d88a290a53..70103934a04a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1565,21 +1565,22 @@ static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
new = kvm_init_table_pte(childp, mm_ops);
stage2_make_pte(ctx, new);
return 0;
}
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_mmu_memory_cache *mc)
{
struct kvm_pgtable_walker walker = {
.cb = stage2_split_walker,
- .flags = KVM_PGTABLE_WALK_LEAF,
+ .flags = KVM_PGTABLE_WALK_LEAF |
+ KVM_PGTABLE_WALK_SKIP_LEVEL3,
.arg = mc,
};
int ret;
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
dsb(ishst);
return ret;
}
int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
--
2.54.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-10 20:57 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 20:21 [PATCH v1 0/2] Optimize S2 page splitting Leonardo Bras
2026-06-10 20:21 ` [PATCH v1 1/2] KVM: arm64: Introduce KVM_PGTABLE_WALK_SKIP_LEVEL* walk flags Leonardo Bras
2026-06-10 20:21 ` [PATCH v1 2/2] KVM: arm64: Make stage2_split_walker() skip unnecessary walks Leonardo Bras
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox