Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations

public inbox for linux-rt-devel@lists.linux.dev
 help / color / mirror / Atom feed

From: Paolo Bonzini <pbonzini@redhat.com>
To: shaikh kamaluddin <shaikhkamal2012@gmail.com>,
	Sean Christopherson <seanjc@google.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rt-devel@lists.linux.dev
Subject: Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations
Date: Fri, 6 Mar 2026 19:14:40 +0100	[thread overview]
Message-ID: <ae2143b3-ce28-4c87-afcf-1505694246d8@redhat.com> (raw)
In-Reply-To: <aactOOfirdVRYfNS@acer-nitro-anv15-41>

On 3/3/26 19:49, shaikh kamaluddin wrote:
> On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
>> On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
>>> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
>>>> mmu_notifier_invalidate_range_start() may be invoked via
>>>> mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
>>>> where sleeping is explicitly forbidden.
>>>>
>>>> KVM's mmu_notifier invalidate_range_start currently takes
>>>> mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
>>>> to rt_mutex and may sleep, triggering:
>>>>
>>>>    BUG: sleeping function called from invalid context
>>>>
>>>> This violates the MMU notifier contract regardless of PREEMPT_RT;
>>
>> I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
>> that in invalidate_range_start() since
>>
>>    e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")
>>
>> which was a full decade before mmu_notifiers even added the blockable concept in
>>
>>    93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
>>
>> and even predate the current concept of a "raw" spinlock introduced by
>>
>>    c2f21ce2e312 ("locking: Implement new raw_spinlock")
>>
>>>> RT kernels merely make the issue deterministic.
>>
>> No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
>> sleepable.
>>
>>>> Fix by converting mn_invalidate_lock to a raw spinlock so that
>>>> invalidate_range_start() remains non-sleeping while preserving the
>>>> existing serialization between invalidate_range_start() and
>>>> invalidate_range_end().
>>
>> This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
>> turned into a raw lock on all KVM architectures.  I suspect the only reason there
>> haven't been bug reports is because no one trips an OOM kill on VM while running
>> with CONFIG_DEBUG_ATOMIC_SLEEP=y.
>>
>> That combination is required because since commit
>>
>>    8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")
>>
>> KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
>> i.e. affects memory that may be mapped into the guest.
>>
>> E.g. this hack to simulate a non-blockable invalidation
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7015edce5bd8..7a35a83420ec 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>                  .handler        = kvm_mmu_unmap_gfn_range,
>>                  .on_lock        = kvm_mmu_invalidate_begin,
>>                  .flush_on_ret   = true,
>> -               .may_block      = mmu_notifier_range_blockable(range),
>> +               .may_block      = false,//mmu_notifier_range_blockable(range),
>>          };
>>   
>>          trace_kvm_unmap_hva_range(range->start, range->end);
>> @@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>           */
>>          gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
>>   
>> +       non_block_start();
>>          /*
>>           * If one or more memslots were found and thus zapped, notify arch code
>>           * that guest memory has been reclaimed.  This needs to be done *after*
>> @@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>           */
>>          if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
>>                  kvm_arch_guest_memory_reclaimed(kvm);
>> +       non_block_end();
>>   
>>          return 0;
>>   }
>>
>> immediately triggers
>>
>>    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
>>    in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
>>    preempt_count: 0, expected: 0
>>    RCU nest depth: 0, expected: 0
>>    CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT
>>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>    Call Trace:
>>     <TASK>
>>     dump_stack_lvl+0x51/0x60
>>     __might_resched+0x10e/0x160
>>     rt_write_lock+0x49/0x310
>>     kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
>>     __mmu_notifier_invalidate_range_start+0x9b/0x230
>>     do_wp_page+0xce1/0xf30
>>     __handle_mm_fault+0x380/0x3a0
>>     handle_mm_fault+0xde/0x290
>>     __get_user_pages+0x20d/0xbe0
>>     get_user_pages_unlocked+0xf6/0x340
>>     hva_to_pfn+0x295/0x420 [kvm]
>>     __kvm_faultin_pfn+0x5d/0x90 [kvm]
>>     kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
>>     kvm_tdp_page_fault+0xb6/0x160 [kvm]
>>     kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
>>     kvm_mmu_page_fault+0x8d/0x600 [kvm]
>>     vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
>>     kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
>>     kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
>>     __x64_sys_ioctl+0x8a/0xd0
>>     do_syscall_64+0x5e/0x11b0
>>     entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>     </TASK>
>>    kvm: emulating exchange as write
>>
>>
>> It's not at all clear to me that switching mmu_lock to a raw lock would be a net
>> positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
>> comically rare scenario.  Whereas contending mmu_lock in normal operation is
>> relatively common (assuming there are even use cases for running VMs with a
>> PREEMPT_RT host kernel).
>>
>> In fact, the only reason the splat happens is because mmu_notifiers somewhat
>> artificially forces an atomic context via non_block_start() since commit
>>
>>    ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
>>
>> Given the massive amount of churn in KVM that would be required to fully eliminate
>> the splat, and that it's not at all obvious that it would be a good change overall,
>> at least for now:
>>
>> NAK
>>
>> I'm not fundamentally opposed to such a change, but there needs to be a _lot_
>> more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
>>
> Hi Sean,
> Thanks for the detailed explanation and for spelling out the border
> issue.
> Understood on both points:
> 	1. The changelog wording was too strong; PREEMPT_RT changes
> 	spin_lock() semantics, and the splat is fundamentally due to
> 	spinlocks becoming sleepable there.
> 	2. Converting only mm_invalidate_lock to raw is insufficient
> 	since KVM can still take the mmu_lock (and other sleeping locks
> 	RT) in invalidate_range_start() when the invalidation hits a
> 	memslot.
> Given the above, it shounds like "convert locks to raw" is not the right
> direction without sinificat rework and justification.
> Would an acceptable direction be to handle the !blockable notifier case
> by deferring the heavyweight invalidation work(anything that take
> mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> while keeping start()/end() accounting consisting with memslot changes ?
> if so, I can protoptype a patch along those lines and share for
> feedback.
> 
> Alternatively, if you think this needs to be addressed in
> mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> redirect my efforts there-Please advise.

Have you considered a "OOM entered" callback for MMU notifiers?  KVM's 
MMU notifier can just remove itself for example, in fact there is code 
in kvm_destroy_vm() to do that even if invalidations are unbalanced.

Paolo

next prev parent reply	other threads:[~2026-03-06 18:14 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-09 16:15 [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations shaikh.kamal
2026-02-11 12:09 ` Sebastian Andrzej Siewior
2026-02-11 15:34   ` Sean Christopherson
2026-03-03 18:49     ` shaikh kamaluddin
2026-03-06 16:42       ` Sean Christopherson
2026-03-06 18:14       ` Paolo Bonzini [this message]
2026-03-12 19:24         ` shaikh kamaluddin
2026-03-14  7:47           ` Paolo Bonzini
2026-03-25  5:19             ` shaikh kamaluddin
2026-03-26 18:23               ` Paolo Bonzini
2026-03-28 14:50                 ` shaikh kamaluddin
2026-03-30 11:24                   ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae2143b3-ce28-4c87-afcf-1505694246d8@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=bigeasy@linutronix.de \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=seanjc@google.com \
    --cc=shaikhkamal2012@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox