From: Paolo Bonzini <pbonzini@redhat.com>
To: shaikh kamaluddin <shaikhkamal2012@gmail.com>,
Sean Christopherson <seanjc@google.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rt-devel@lists.linux.dev
Subject: Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
Date: Fri, 6 Mar 2026 19:14:40 +0100 [thread overview]
Message-ID: <ae2143b3-ce28-4c87-afcf-1505694246d8@redhat.com> (raw)
In-Reply-To: <aactOOfirdVRYfNS@acer-nitro-anv15-41>
On 3/3/26 19:49, shaikh kamaluddin wrote:
> On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
>> On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
>>> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
>>>> mmu_notifier_invalidate_range_start() may be invoked via
>>>> mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
>>>> where sleeping is explicitly forbidden.
>>>>
>>>> KVM's mmu_notifier invalidate_range_start currently takes
>>>> mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
>>>> to rt_mutex and may sleep, triggering:
>>>>
>>>> BUG: sleeping function called from invalid context
>>>>
>>>> This violates the MMU notifier contract regardless of PREEMPT_RT;
>>
>> I highly doubt that. kvm.mmu_lock is also a spinlock, and KVM has been taking
>> that in invalidate_range_start() since
>>
>> e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
>>
>> which was a full decade before mmu_notifiers even added the blockable concept in
>>
>> 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
>>
>> and even predate the current concept of a "raw" spinlock introduced by
>>
>> c2f21ce2e312 ("locking: Implement new raw_spinlock")
>>
>>>> RT kernels merely make the issue deterministic.
>>
>> No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
>> sleepable.
>>
>>>> Fix by converting mn_invalidate_lock to a raw spinlock so that
>>>> invalidate_range_start() remains non-sleeping while preserving the
>>>> existing serialization between invalidate_range_start() and
>>>> invalidate_range_end().
>>
>> This is insufficient. To actually "fix" this in KVM mmu_lock would need to be
>> turned into a raw lock on all KVM architectures. I suspect the only reason there
>> haven't been bug reports is because no one trips an OOM kill on VM while running
>> with CONFIG_DEBUG_ATOMIC_SLEEP=y.
>>
>> That combination is required because since commit
>>
>> 8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
>>
>> KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
>> i.e. affects memory that may be mapped into the guest.
>>
>> E.g. this hack to simulate a non-blockable invalidation
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7015edce5bd8..7a35a83420ec 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>> .handler = kvm_mmu_unmap_gfn_range,
>> .on_lock = kvm_mmu_invalidate_begin,
>> .flush_on_ret = true,
>> - .may_block = mmu_notifier_range_blockable(range),
>> + .may_block = false,//mmu_notifier_range_blockable(range),
>> };
>>
>> trace_kvm_unmap_hva_range(range->start, range->end);
>> @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>> */
>> gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
>>
>> + non_block_start();
>> /*
>> * If one or more memslots were found and thus zapped, notify arch code
>> * that guest memory has been reclaimed. This needs to be done *after*
>> @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>> */
>> if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>> kvm_arch_guest_memory_reclaimed(kvm);
>> + non_block_end();
>>
>> return 0;
>> }
>>
>> immediately triggers
>>
>> BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
>> in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
>> preempt_count: 0, expected: 0
>> RCU nest depth: 0, expected: 0
>> CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x51/0x60
>> __might_resched+0x10e/0x160
>> rt_write_lock+0x49/0x310
>> kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
>> __mmu_notifier_invalidate_range_start+0x9b/0x230
>> do_wp_page+0xce1/0xf30
>> __handle_mm_fault+0x380/0x3a0
>> handle_mm_fault+0xde/0x290
>> __get_user_pages+0x20d/0xbe0
>> get_user_pages_unlocked+0xf6/0x340
>> hva_to_pfn+0x295/0x420 [kvm]
>> __kvm_faultin_pfn+0x5d/0x90 [kvm]
>> kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
>> kvm_tdp_page_fault+0xb6/0x160 [kvm]
>> kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
>> kvm_mmu_page_fault+0x8d/0x600 [kvm]
>> vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
>> kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
>> kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
>> __x64_sys_ioctl+0x8a/0xd0
>> do_syscall_64+0x5e/0x11b0
>> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>> </TASK>
>> kvm: emulating exchange as write
>>
>>
>> It's not at all clear to me that switching mmu_lock to a raw lock would be a net
>> positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a
>> comically rare scenario. Whereas contending mmu_lock in normal operation is
>> relatively common (assuming there are even use cases for running VMs with a
>> PREEMPT_RT host kernel).
>>
>> In fact, the only reason the splat happens is because mmu_notifiers somewhat
>> artificially forces an atomic context via non_block_start() since commit
>>
>> ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
>>
>> Given the massive amount of churn in KVM that would be required to fully eliminate
>> the splat, and that it's not at all obvious that it would be a good change overall,
>> at least for now:
>>
>> NAK
>>
>> I'm not fundamentally opposed to such a change, but there needs to be a _lot_
>> more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
>>
> Hi Sean,
> Thanks for the detailed explanation and for spelling out the border
> issue.
> Understood on both points:
> 1. The changelog wording was too strong; PREEMPT_RT changes
> spin_lock() semantics, and the splat is fundamentally due to
> spinlocks becoming sleepable there.
> 2. Converting only mm_invalidate_lock to raw is insufficient
> since KVM can still take the mmu_lock (and other sleeping locks
> RT) in invalidate_range_start() when the invalidation hits a
> memslot.
> Given the above, it shounds like "convert locks to raw" is not the right
> direction without sinificat rework and justification.
> Would an acceptable direction be to handle the !blockable notifier case
> by deferring the heavyweight invalidation work(anything that take
> mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> while keeping start()/end() accounting consisting with memslot changes ?
> if so, I can protoptype a patch along those lines and share for
> feedback.
>
> Alternatively, if you think this needs to be addressed in
> mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> redirect my efforts there-Please advise.
Have you considered a "OOM entered" callback for MMU notifiers? KVM's
MMU notifier can just remove itself for example, in fact there is code
in kvm_destroy_vm() to do that even if invalidations are unbalanced.
Paolo
next prev parent reply other threads:[~2026-03-06 18:14 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal
2026-02-11 12:09 ` Sebastian Andrzej Siewior
2026-02-11 15:34 ` Sean Christopherson
2026-03-03 18:49 ` shaikh kamaluddin
2026-03-06 16:42 ` Sean Christopherson
2026-03-06 18:14 ` Paolo Bonzini [this message]
2026-03-12 19:24 ` shaikh kamaluddin
2026-03-14 7:47 ` Paolo Bonzini
2026-03-25 5:19 ` shaikh kamaluddin
2026-03-26 18:23 ` Paolo Bonzini
2026-03-28 14:50 ` shaikh kamaluddin
2026-03-30 11:24 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ae2143b3-ce28-4c87-afcf-1505694246d8@redhat.com \
--to=pbonzini@redhat.com \
--cc=bigeasy@linutronix.de \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=seanjc@google.com \
--cc=shaikhkamal2012@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox