From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F92A2C08BD
	for <linux-kernel@vger.kernel.org>; Wed, 11 Feb 2026 15:34:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770824067; cv=none; b=NIfjFsoZX9bXnptdput2iaOouIE0W8iXpgHnVPVZzyrg0GbHlgLawH0AStgDc9MLjDRbir7HqTMCdQaxn6tlXIPkqb3hJquXEKBnUrvJPGPwLjNAFtDi1Ii5fQB2J+nxzcImyqjySoKMYysMw/4hXjsZF+kmXBwq+Sw6HjPKkzo=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770824067; c=relaxed/simple;
	bh=Y4BBXD4TqbsbTCp/hhnEWPoBXyVZSj4HcFNm9BuH0io=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type; b=snEf6p6HQHcdgyDCZ9kc06zsS0Hi0vLpELJzyDsozalx04VMzISny+jGYFKZC/WUQVtAPzJ0kEYfxR4zlOwa/ohXn0WGWO99EsJZeVkKnL5+bV/YdzZJmqdhs1HLSolDDMC2wSW11yXtfev0CZ/RPZ14zR+Eb+0Uss1Ix6LqupA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=xeo+hWCY; arc=none smtp.client-ip=209.85.215.201
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="xeo+hWCY"
Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-c6e18ade2c2so626380a12.1
        for <linux-kernel@vger.kernel.org>; Wed, 11 Feb 2026 07:34:24 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1770824064; x=1771428864; darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=4KEeCZROYJyn1dyhO5+lkxTzIro646ilaaeEHlML6XE=;
        b=xeo+hWCYkyMm1ZQ2/0XzWHZzHm9Ue7TQ3bvUpRaPyz+ax0pRwPsfsbMkFeStoJzQ7/
         gKTitcjL1z84G1MgXQKfy9WpuQbjuiY1wvP75Nqb/iVGugJskQNHrS4NdjIWdPWCo+Sl
         MIBbWcsiH9Wy6SV+KKinHp5Hi9w1UoWxKyqanhEeOxCALc+fbDeSM+JqKThn1tHx3Omr
         UCB/JlrYVDvsQGulA7wcAo+w6D94VlrMgQ9sstm8LdcA0u5SgSeQKYl9WVIMF1pzvrdZ
         q2PBxMqQzssnAL0wekZrk6u4uemakV8dbh21modvlPvRRduA2bfJXa+5zIIW0McWLKA/
         q0IQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1770824064; x=1771428864;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=4KEeCZROYJyn1dyhO5+lkxTzIro646ilaaeEHlML6XE=;
        b=RB1ILTKinQeXPstc7oPtovYDzv+rpYra4x/K7GPXdhteaIVlelVp2BihC/ue/DxYax
         SukUbZHsJmpLK6agfBrOvyKezYM4Z6ReHniEJzXjxjOg1zeZuMFwTwffPCTLg1HhFJhG
         GpkqVwFu9bcZ84nzKUrcnrshkhsi6GtDh2gM54wS8yUFQYomvwYteSBDTOagWI11yHsl
         mBdTXznIsZd7J/N/WSKunrlUUlpyDuR7TjdyTK5FLdw/RosCa7I318RiirolJ5C7MHUu
         OSRCbrd0dyJO3UN0iXV2EGYyd2lmLQfMKNqIid6iE5H/J0i4ZnOi1fLVHQjIHvaO3Bpg
         JmVw==
X-Forwarded-Encrypted: i=1; AJvYcCV7qZfSBmx2qcHrp0royhjmD65m/qE7wOrHnp9ggk5TLGGZoE/HBROPvEBLngEjabwHI2MIUAxkqBpU06k=@vger.kernel.org
X-Gm-Message-State: AOJu0YxlOfnmh7W6F5UEs7Evtd5xUcw6F1Dqsw0It4VTbYQpfblUGsQw
	gKHExY8Tkr/SdY519HHW6SmKKK/MvNp88sPJxRmLrehczMIQsy4g29kTMo+w1wpmkUnawznI6L6
	ApBHF3g==
X-Received: from pjyd14.prod.google.com ([2002:a17:90a:dfce:b0:34c:e69b:d74f])
 (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:d12f:b0:393:74ed:7de9
 with SMTP id adf61e73a8af0-393acf885dbmr16306207637.3.1770824063511; Wed, 11
 Feb 2026 07:34:23 -0800 (PST)
Date: Wed, 11 Feb 2026 07:34:22 -0800
In-Reply-To: <20260211120944.-eZhmdo7@linutronix.de>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20260209161527.31978-1-shaikhkamal2012@gmail.com> <20260211120944.-eZhmdo7@linutronix.de>
Message-ID: <aYyhfvC_2s000P7H@google.com>
Subject: Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping
 for non-blocking invalidations
From: Sean Christopherson <seanjc@google.com>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "shaikh.kamal" <shaikhkamal2012@gmail.com>, kvm@vger.kernel.org, 
	linux-kernel@vger.kernel.org, linux-rt-devel@lists.linux.dev
Content-Type: text/plain; charset="us-ascii"

On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:
> > mmu_notifier_invalidate_range_start() may be invoked via
> > mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
> > where sleeping is explicitly forbidden.
> > 
> > KVM's mmu_notifier invalidate_range_start currently takes
> > mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
> > to rt_mutex and may sleep, triggering:
> > 
> >   BUG: sleeping function called from invalid context
> > 
> > This violates the MMU notifier contract regardless of PREEMPT_RT;

I highly doubt that.  kvm.mmu_lock is also a spinlock, and KVM has been taking
that in invalidate_range_start() since

  e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")

which was a full decade before mmu_notifiers even added the blockable concept in

  93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")

and even predate the current concept of a "raw" spinlock introduced by

  c2f21ce2e312 ("locking: Implement new raw_spinlock")

> > RT kernels merely make the issue deterministic.

No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
sleepable.

> > Fix by converting mn_invalidate_lock to a raw spinlock so that
> > invalidate_range_start() remains non-sleeping while preserving the
> > existing serialization between invalidate_range_start() and
> > invalidate_range_end().

This is insufficient.  To actually "fix" this in KVM mmu_lock would need to be
turned into a raw lock on all KVM architectures.  I suspect the only reason there
haven't been bug reports is because no one trips an OOM kill on VM while running
with CONFIG_DEBUG_ATOMIC_SLEEP=y.

That combination is required because since commit

  8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")

KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
i.e. affects memory that may be mapped into the guest.

E.g. this hack to simulate a non-blockable invalidation

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7015edce5bd8..7a35a83420ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
                .handler        = kvm_mmu_unmap_gfn_range,
                .on_lock        = kvm_mmu_invalidate_begin,
                .flush_on_ret   = true,
-               .may_block      = mmu_notifier_range_blockable(range),
+               .may_block      = false,//mmu_notifier_range_blockable(range),
        };
 
        trace_kvm_unmap_hva_range(range->start, range->end);
@@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
         */
        gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
 
+       non_block_start();
        /*
         * If one or more memslots were found and thus zapped, notify arch code
         * that guest memory has been reclaimed.  This needs to be done *after*
@@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
         */
        if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
                kvm_arch_guest_memory_reclaimed(kvm);
+       non_block_end();
 
        return 0;
 }

immediately triggers

  BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
  in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
  preempt_count: 0, expected: 0
  RCU nest depth: 0, expected: 0
  CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT 
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x51/0x60
   __might_resched+0x10e/0x160
   rt_write_lock+0x49/0x310
   kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
   __mmu_notifier_invalidate_range_start+0x9b/0x230
   do_wp_page+0xce1/0xf30
   __handle_mm_fault+0x380/0x3a0
   handle_mm_fault+0xde/0x290
   __get_user_pages+0x20d/0xbe0
   get_user_pages_unlocked+0xf6/0x340
   hva_to_pfn+0x295/0x420 [kvm]
   __kvm_faultin_pfn+0x5d/0x90 [kvm]
   kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
   kvm_tdp_page_fault+0xb6/0x160 [kvm]
   kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
   kvm_mmu_page_fault+0x8d/0x600 [kvm]
   vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
   kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0x5e/0x11b0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  kvm: emulating exchange as write


It's not at all clear to me that switching mmu_lock to a raw lock would be a net
positive for PREEMPT_RT.  OOM-killing a KVM guest in a PREEMPT_RT seems like a
comically rare scenario.  Whereas contending mmu_lock in normal operation is
relatively common (assuming there are even use cases for running VMs with a
PREEMPT_RT host kernel).

In fact, the only reason the splat happens is because mmu_notifiers somewhat
artificially forces an atomic context via non_block_start() since commit

  ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")

Given the massive amount of churn in KVM that would be required to fully eliminate
the splat, and that it's not at all obvious that it would be a good change overall,
at least for now:

NAK

I'm not fundamentally opposed to such a change, but there needs to be a _lot_
more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 5fcd401a5897..7a9c33f01a37 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -747,9 +747,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	 *
> >  	 * Pairs with the decrement in range_end().
> >  	 */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	kvm->mn_active_invalidate_count++;
> > -	spin_unlock(&kvm->mn_invalidate_lock);
> > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> 
> 	atomic_inc(mn_active_invalidate_count)
> >  
> >  	/*
> >  	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> > @@ -817,11 +817,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  	kvm_handle_hva_range(kvm, &hva_range);
> >  
> >  	/* Pairs with the increment in range_start(). */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
> >  		--kvm->mn_active_invalidate_count;
> >  	wake = !kvm->mn_active_invalidate_count;
> 
> 	wake = atomic_dec_return_safe(mn_active_invalidate_count);
> 	WARN_ON_ONCE(wake < 0);
> 	wake = !wake;
> 
> > -	spin_unlock(&kvm->mn_invalidate_lock);
> > +	raw_spin_unlock(&kvm->mn_invalidate_lock);
> >  
> >  	/*
> >  	 * There can only be one waiter, since the wait happens under
> > @@ -1129,7 +1129,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > @@ -1635,17 +1635,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
> >  	 * progress, otherwise the locking in invalidate_range_start and
> >  	 * invalidate_range_end will be unbalanced.
> >  	 */
> > -	spin_lock(&kvm->mn_invalidate_lock);
> > +	raw_spin_lock(&kvm->mn_invalidate_lock);
> >  	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
> >  	while (kvm->mn_active_invalidate_count) {
> >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > -		spin_unlock(&kvm->mn_invalidate_lock);
> > +		raw_spin_unlock(&kvm->mn_invalidate_lock);
> >  		schedule();
> 
> And this I don't understand. The lock protects the rcuwait assignment
> which would be needed if multiple waiters are possible. But this goes
> away after the unlock and schedule() here. So these things could be
> moved outside of the locked section which limits it only to the
> mn_active_invalidate_count value.

The implementation is essentially a deliberately unfair rwswem.  The "write" side
in kvm_swap_active_memslots() subtly protect this code:

  rcu_assign_pointer(kvm->memslots[as_id], slots);

and the "read" side protects the kvm->memslot lookups in kvm_handle_hva_range().

KVM optimizes its mmu_notifier invalidation path to only take action if the
to-be-invalidated range overlaps one or more memslots, i.e. affects memory that
be can be mapped into the guest.  The wrinkle with those optimizations is that
KVM needs to prevent changes to the memslots between invalidation start() and end(),
otherwise the accounting can become imbalanced, e.g. mmu_invalidate_in_progress
will underflow or be left elevated and essentially hang the VM (among other bad
things).

So simply making mn_active_invalidate_count an atomic won't suffice, because KVM
needs to block start() to ensure start()+end() see the exact same set of memslots.