Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jann Horn <jannh@google.com>, Will Deacon <will@kernel.org>,
	Joerg Roedel <jroedel@suse.de>,
	jean-philippe.brucker@arm.com, Linux-MM <linux-mm@kvack.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	iommu@lists.linux.dev
Subject: Re: some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap())
Date: Mon, 26 Sep 2022 20:13:00 +0000	[thread overview]
Message-ID: <YzIHzIxknGNba6CC@google.com> (raw)
In-Reply-To: <Yy3skVk/DvwVnPXD@nvidia.com>

On Fri, Sep 23, 2022, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 05:38:12PM +0200, Jann Horn wrote:
> > Hi!
> > 
> > I looked through some of the code related to IOMMUv2 (the thing where
> > the IOMMU walks the normal userspace page tables and TLB shootdowns
> > are replicated to the IOMMU through
> > mmu_notifier_ops::invalidate_range).
> > 
> > I think there's a bug in the interaction between tlb_finish_mmu() and
> > mmu_notifier_ops::invalidate_range: In the mm_tlb_flush_nested() case,
> > __tlb_reset_range() sets tlb->start and tlb->end *both* to ~0.
> > Afterwards, tlb_finish_mmu() calls
> > tlb_flush_mmu()->tlb_flush_mmu_tlbonly()->mmu_notifier_invalidate_range(),
> > which will pass those tlb->start and tlb->end values to
> > mmu_notifier_ops::invalidate_range callbacks. But those callbacks
> > don't know about this special case and then basically only flush
> > virtual address ~0, making the flush useless. 
> 
> Yeah, that looks wrong to me, and it extends more than just the iommu
> drivers kvm_arch_mmu_notifier_invalidate_range() also does not handle
> this coding.

FWIW, the bug is likely benign for KVM.  KVM does almost all of its TLB flushing
via invalidate_range_{start,end}(), the invalidate_range() hook is used only by
x86/VMX to react to a specific KVM-allocated page being migrated (the page is only
ever unmapped when the VM is dying).

> Most likely tlb_flush_mmu_tlbonly() need to change it back to 0 to ~0?
> I wonder why it uses such an odd coding in the first place?
> 
> Actually, maybe having mm_tlb_flush_nested() call __tlb_reset_range()
> to generate a 'flush all' request is just a bad idea, as we already
> had another bug in 7a30df49f63ad92 related to reset_range doing the
> wrong thing for a flush all action.
> 
> > (However, pretty much every place that calls tlb_finish_mmu() first
> > calls mmu_notifier_invalidate_range_end() even though the
> > appropriate thing would probably be
> > mmu_notifier_invalidate_range_only_end(); and I think those two
> > things probably cancel each other out?)
> 
> That does sound like double flushing to me, though as you note below,
> the invalidate_range() triggered by range_end() after the TLB
> flush/page freeing is functionally incorrect, so we cannot rely on it.
> 
> > Also, from what I can tell, the mremap() code, in move_page_tables(),
> > only invokes mmu_notifier_ops::invalidate_range via the
> > mmu_notifier_invalidate_range_end() at the very end, long after TLB
> > flushes must have happened - sort of like the bug we had years ago
> > where mremap() was flushing the normal TLBs too late
> > (https://bugs.chromium.org/p/project-zero/issues/detail?id=1695).
> 
> Based on the description of eb66ae03082960 I would say that yes the
> invalidate_range op is missing here for the same reasons the CPU flush
> was missing.
> 
> AFAIK if we are flushing the CPU tlb then we really must also flush
> the CPU tlb that KVM controls, and that is primarily what
> invalidate_range() is used for.

As above, for its actual secondary MMU, KVM invalidates and flushes at
invalidate_range_start(), and then prevents vCPUs from creating new entries for
the range until invalidate_range_start_end().

The VMX use case is for a physical address that is consumed by hardware without
going through the secondary page tables; using the start/end hooks would be slightly
annoying due to the need to stall the vCPU until end, and so KVM uses invalidate_range()
for that one specific case.

> Which makes me wonder if the invalidate_range() hidden inside
> invalidate_end() is a bad idea in general - when is this need and
> would be correct? Isn't it better to put the invalidates near the TLB
> invalidates and leave start/end as purely a bracketing API, which by
> definition, cannot have an end that is 'too late'?

Documentation/mm/mmu_notifier.rst explains this, although even that is quite subtle.
The argument is that if the change is purely to downgrade protections, then
deferring invalidate_range() is ok because the only requirement is that secondary
MMUs invalidate before the "end" of the sequence.

  When changing a pte to write protect or to point to a new write protected page  
  with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range   
  call to mmu_notifier_invalidate_range_end() outside the page table lock. This   
  is true even if the thread doing the page table update is preempted right after 
  releasing page table lock but before call mmu_notifier_invalidate_range_end().

That said, I also dislike hiding invalidate_range() inside end(), I constantly
forget about that behavior.  To address that, what about renaming
mmu_notifier_invalidate_range_end() to make it more explicit, e.g.
mmu_notifier_invalidate_range_and_end().

next prev parent reply	other threads:[~2022-09-26 20:13 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-23 15:38 some likely bugs in IOMMUv2 (in tlb_finish_mmu() nested flush and mremap()) Jann Horn
2022-09-23 17:27 ` Jason Gunthorpe
2022-09-26 20:13   ` Sean Christopherson [this message]
2022-09-26 23:32     ` Jason Gunthorpe
2022-09-27  0:24       ` Sean Christopherson
2022-09-28 18:12         ` Jason Gunthorpe
2022-09-30  2:31           ` John Hubbard
2022-09-30 12:01             ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YzIHzIxknGNba6CC@google.com \
    --to=seanjc@google.com \
    --cc=iommu@lists.linux.dev \
    --cc=jannh@google.com \
    --cc=jean-philippe.brucker@arm.com \
    --cc=jgg@nvidia.com \
    --cc=jroedel@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.