Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Oliver Upton <oupton@kernel.org>
To: Leonardo Bras <leo.bras@arm.com>
Cc: sashiko-reviews@lists.linux.dev, Marc Zyngier <maz@kernel.org>,
	kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	Wei-Lin Chang <weilin.chang@arm.com>
Subject: Re: [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available
Date: Mon, 29 Jun 2026 10:06:38 -0700	[thread overview]
Message-ID: <akKmHqRaZEUjN3zY@kernel.org> (raw)
In-Reply-To: <akKFlzAdB43lRpi1@LeoBrasDK>

On Mon, Jun 29, 2026 at 03:47:51PM +0100, Leonardo Bras wrote:
> On Mon, Jun 29, 2026 at 11:36:45AM +0000, sashiko-bot@kernel.org wrote:
> > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> > 
> > New issues:
> > - [High] The default eager page splitting chunk size violates the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE UAPI contract by using a value that is not a valid block size for 16KB and 64KB page kernels.
> > 
> > Pre-existing issues:
> > - [Critical] Dropping mmu_lock between primary and nested stage-2 write-protection creates a window where nested guests can bypass dirty logging.
> > --
> > 
> > commit 09e1d250ed331b9dc20b42b7f9315892a86a5d50
> > Author: Leonardo Bras <leo.bras@arm.com>
> > 
> > KVM: arm64: Enable eager hugepage splitting if HDBSS is available
> > 
> > This commit enables eager hugepage splitting by default for systems
> > supporting HDBSS, setting the default chunk size to 256K pages at
> > stage2 initialization.
> > 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 6c941aaa10c63..e086c01a9325e 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1027,8 +1027,12 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> > >  	for_each_possible_cpu(cpu)
> > >  		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
> > >  
> > > -	 /* The eager page splitting is disabled by default */
> > > -	mmu->split_page_chunk_size = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
> > > +	 /* The eager page splitting is disabled by default if system has no HDBSS */
> > > +	if (system_supports_hdbss())
> > > +		mmu->split_page_chunk_size = 256 * 1024 * PAGE_SIZE;
> > 
> > [Severity: High]
> > Does this hardcoded chunk size violate the KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
> > UAPI contract on kernels with 16KB or 64KB pages?
> > 
> > By setting the default chunk size to 256 * 1024 * PAGE_SIZE, this evaluates to
> > 4GB on 16KB page systems and 16GB on 64KB page systems.
> > 
> > However, in arch/arm64/kvm/arm.c:kvm_vm_ioctl_enable_cap(), the chunk size is
> > validated against valid architectural block sizes:
> > 
> >     if (!new_cap || kvm_is_block_size_supported(new_cap)) {
> >     ...
> > 
> > Since 4GB and 16GB are not valid block sizes on those configurations, a VMM
> > like QEMU attempting to read the default capability via KVM_CHECK_EXTENSION
> > and set it back via KVM_ENABLE_CAP will be rejected with -EINVAL.
> 
> Sure, changing this to something like PMD_SIZE or PUD_SIZE should be no issue.
>  
> But this raises a topic I would like to understand:
> - Do we actually need this to be a block_size to assure correctness? or is 
>   it just about efficiency?

What value is there in having a chunk size larger than the largest
possible block mapping? The whole UAPI is deliberately tied up with page
table geometry.

Overall, I'm not buying the argument for changing the behavior of
KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE. There are very good reasons for
*not* eagerly splitting the entire address space, especially if you know
the working set of the VM is small.

You can still use HDBSS without eagerly splitting, so long as block
mappings are {DBM, S2AP_W} = {0, 0} and leaf mappings (which have
a writable PFN) are {1, 0}.

> > [Severity: Critical]
> > This is a pre-existing issue, but by enabling eager hugepage splitting by
> > default for HDBSS, does this exacerbate a race condition with nested guests
> > bypassing dirty logging?
> > 
> > In arch/arm64/kvm/mmu.c:kvm_arch_mmu_enable_log_dirty_pt_masked(), KVM enables
> > dirty logging by write-protecting the primary stage-2 and then calling
> > kvm_mmu_split_huge_pages():
> > 
> >     if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> >         kvm_mmu_split_huge_pages(kvm, start, end);
> > 
> >     kvm_nested_s2_wp(kvm);
> > 
> > Because kvm_mmu_split_huge_pages() can drop the mmu_lock to top-up the memory
> > cache or reschedule, a race window is created before kvm_nested_s2_wp()
> > protects the nested stage-2 tables.
> > 
> > While the lock is dropped, a nested VCPU can fault, map the page via its
> > separate shadow stage-2 (which hasn't been write-protected yet), and write to
> > the page without setting the dirty bitmap, leading to silent memory corruption
> > after live migration.
> > 
> 
> But if it's already write-protected in S2, if the nested VCPU faults 
> something, would not that be handled by non-nested faulting routine?
> 
> I understand very little of nested, and I can't understand how does that 
> work properly. Maybe Marc or Wei Lin may have a better answer for that.

The only time this would be an issue is if we allowed userspace to
observe a clean dirty bitmap with a stale, writable shadow stage-2. All
roads go through the slots_lock, so I don't believe this is actually an
issue.

The order of events Sashiko mentions makes no sense. If the shadow
stage-2 translation is already writable there would be no intervening
fault. Since the dirty bitmap is already cleared for these GFNs, a
racing write permission fault would actually set the dirty bitmap.

With that being said, I'd be happy with kvm_nested_s2_wp() getting moved
up earlier in the function anyway as it is easy to forget that the
mmu_lock could be dropped in between.

Thanks,
Oliver

  reply	other threads:[~2026-06-29 17:06 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-29 11:17 [PATCH v2 00/13] KVM Dirty-bit cleaning hw accelerator (HACDBS) Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 01/13] KVM: arm64: HDBSS bits Leonardo Bras
2026-06-29 11:34   ` sashiko-bot
2026-06-29 12:57     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 02/13] KVM: arm64: Enable eager hugepage splitting if HDBSS is available Leonardo Bras
2026-06-29 11:36   ` sashiko-bot
2026-06-29 14:47     ` Leonardo Bras
2026-06-29 17:06       ` Oliver Upton [this message]
2026-06-30 12:58         ` Leonardo Bras
2026-06-30 15:44           ` Oliver Upton
2026-06-30 17:09             ` Leonardo Bras
2026-06-30 18:43               ` Oliver Upton
2026-06-29 11:17 ` [PATCH v2 03/13] arm64/cpufeature: Add system-wide FEAT_HACDBS detection Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 04/13] arm64/sysreg: Add HACDBS consumer and base registers Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 05/13] KVM: arm64: Detect (via ACPI) and initialize HACDBSIRQ Leonardo Bras
2026-06-29 11:32   ` sashiko-bot
2026-06-29 15:43     ` Leonardo Bras
2026-06-29 16:52       ` Vladimir Murzin
2026-06-30 14:52         ` Leonardo Bras
2026-06-29 17:22   ` Oliver Upton
2026-06-30 14:50     ` Leonardo Bras
2026-06-30 16:03       ` Oliver Upton
2026-06-30 17:19         ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 06/13] KVM: arm64: dirty_bit: Add base FEAT_HACDBS cleaning routine Leonardo Bras
2026-06-29 11:29   ` sashiko-bot
2026-06-29 15:54     ` Leonardo Bras
2026-06-29 17:36   ` Oliver Upton
2026-06-30 14:59     ` Leonardo Bras
2026-06-30 19:06       ` Oliver Upton
2026-06-29 11:17 ` [PATCH v2 07/13] kvm: Add arch-generic interface for hw-accelerated dirty-bitmap cleaning Leonardo Bras
2026-06-29 11:38   ` sashiko-bot
2026-06-29 16:07     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 08/13] KVM: arm64: Add hardware-accelerated dirty-bitmap cleaning routine Leonardo Bras
2026-06-29 11:45   ` sashiko-bot
2026-06-29 16:49     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 09/13] KVM: arm64: Dirty-bitmap: avoid splitting previously split blocks Leonardo Bras
2026-06-29 11:39   ` sashiko-bot
2026-06-29 17:07     ` Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 10/13] kvm/dirty_ring: Introduce get_memslot and move helpers to header Leonardo Bras
2026-06-29 11:17 ` [PATCH v2 11/13] kvm/dirty_ring: Add arch-generic interface for hw-accelerated dirty-ring cleaning Leonardo Bras
2026-06-29 11:49   ` sashiko-bot
2026-06-29 17:09     ` Leonardo Bras
2026-06-29 11:18 ` [PATCH v2 12/13] KVM: arm64: Add hardware-accelerated dirty-ring cleaning routine Leonardo Bras
2026-06-29 11:49   ` sashiko-bot
2026-06-29 17:26     ` Leonardo Bras
2026-06-29 11:18 ` [PATCH v2 13/13] KVM: arm64: Enable KVM_HW_DIRTY_BIT Leonardo Bras
2026-06-29 11:52   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=akKmHqRaZEUjN3zY@kernel.org \
    --to=oupton@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=leo.bras@arm.com \
    --cc=maz@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    --cc=weilin.chang@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox