Re: [RFC PATCH v2 1/6] KVM: gmem: Truncate pages on punch hole

linux-coco.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Michael Roth <michael.roth@amd.com>
Cc: isaku.yamahata@intel.com, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	 Paolo Bonzini <pbonzini@redhat.com>,
	erdemaktas@google.com, Sagi Shahar <sagis@google.com>,
	 David Matlack <dmatlack@google.com>,
	Kai Huang <kai.huang@intel.com>,
	 Zhi Wang <zhi.wang.linux@gmail.com>,
	chen.bo@intel.com, linux-coco@lists.linux.dev,
	 Chao Peng <chao.p.peng@linux.intel.com>,
	Ackerley Tng <ackerleytng@google.com>,
	 Vishal Annapurve <vannapurve@google.com>,
	Yuan Yao <yuan.yao@linux.intel.com>,
	 Jarkko Sakkinen <jarkko@kernel.org>,
	Xu Yilun <yilun.xu@intel.com>,
	 Quentin Perret <qperret@google.com>,
	wei.w.wang@intel.com, Fuad Tabba <tabba@google.com>
Subject: Re: [RFC PATCH v2 1/6] KVM: gmem: Truncate pages on punch hole
Date: Thu, 5 Oct 2023 16:48:50 -0700	[thread overview]
Message-ID: <ZR9LYhpxTaTk6PJX@google.com> (raw)
In-Reply-To: <20231005175238.7bb2zut4fb7ebdqc@amd.com>

On Thu, Oct 05, 2023, Michael Roth wrote:
> On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote:
> > On Thu, Sep 21, 2023, Sean Christopherson wrote:
> > > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> > > > index a819367434e9..01fb4ca861d0 100644
> > > > --- a/virt/kvm/guest_mem.c
> > > > +++ b/virt/kvm/guest_mem.c
> > > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> > > >  static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > > >  {
> > > >  	struct list_head *gmem_list = &inode->i_mapping->private_list;
> > > > +	struct address_space *mapping  = inode->i_mapping;
> > > >  	pgoff_t start = offset >> PAGE_SHIFT;
> > > >  	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > > >  	struct kvm_gmem *gmem;
> > > >  
> > > > +	/*
> > > > +	 * punch hole may result in zeroing partial area.  As pages can be
> > > > +	 * encrypted, prohibit zeroing partial area.
> > > > +	 */
> > > > +	if (offset & ~PAGE_MASK || len & ~PAGE_MASK)
> > > > +		return -EINVAL;
> > > 
> > > This should be unnecessary, kvm_gmem_fallocate() does
> > > 
> > > 	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > 		return -EINVAL;
> > > 
> > > before invoking kvm_gmem_punch_hole().  If that's not working, i.e. your test
> > > fails, then that code needs to be fixed.  I'll run your test to double-check,
> > > but AFAICT this is unnecesary.
> > 
> > I confirmed that the testcase passes without the extra checks.  Just to close the
> > loop, what prompted adding more checks to kvm_gmem_punch_hole()?
> 
> I don't know if it's the same issue that Isaku ran into, but for SNP we
> hit a similar issue with the truncate_inode_pages_range(lstart, lend) call.
> 
> The issue in that case was a bit more subtle:
> 
>   - userspace does a hole-punch on a 4K range of its gmem FD, which happens
>     to be backed by a 2MB folio.
>   - truncate_inode_pages_range() gets called for that 4K range
>   - truncate_inode_pages_range() does special handling on the folios at the
>     start/end of the range in case they are partial and passes these to
>     truncate_inode_partial_folio(folio, lstart, lend). In this case, there's
>     just the 1 backing folio. But it *still* gets the special treatment, and
>     so gets passed to truncate_inode_partial_folio().
>   - truncate_inode_partial_folio() will then zero that 4K range, even though
>     it is page-aligned, based on the following rationale in the comments:
> 
>         /*
>          * We may be zeroing pages we're about to discard, but it avoids
>          * doing a complex calculation here, and then doing the zeroing
>          * anyway if the page split fails.
>          */
>         folio_zero_range(folio, offset, length);
> 
>   - after that, .invalidate_folio callback is issued, then the folio is split,
>     and the caller (truncate_inode_pages_range()) does another pass through
> 	the whole range and can free the now-split folio then .free_folio callbacks
>     are issued.
> 
> Because of that, we can't rely on .invalidate_folio/.free_folio to handle
> putting the page back into a normal host-accessible state, because the
> zero'ing will happen beforehand.

Argh, and that causes an RMP violation #PF.

FWIW, I don't *think* zeroing would be problematic for TDX.  The page would get
poisoned, but KVM would re-zero the memory with MOVDIR64B and flush the cache.

> That's why we ended up needing to do this for SNP patches to make sure
> arch-specific invalidation callbacks are issued before the truncation occurs:
> 
>   https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233
> 
> I'd planned to post those as a separate RFC to discuss, but when I came across
> this it seemed like it might be relevant to what the TDX folks might ran into
> here.
> 
> If not for the zero'ing logic mentioned above, for SNP at least, the
> .free_folio() ends up working pretty nicely for both truncation and fput(),
> and even plays nicely with live update use-case where the destination gmem
> instance shares the inode->i_mapping, since iput() won't trigger the
> truncate_inode_pages_final() until the last reference goes away so we don't
> have to do anything special in kvm_gmem_release() to determine when we
> should/shouldn't issue the arch-invalidations to clean up things like the
> RMP table.
> 
> It seems like the above zero'ing logic could be reworked to only zero non-page
> aligned ranges (as the comments above truncate_inode_pages_range() claim
> should be the case), which would avoid the issue for the gmem use-case. But I
> wonder if some explicit "dont-zero-these-pages" flag might be more robust.
> 
> Or maybe there's some other way we should be going about this?

Skipping the write seems like the obvious solution.  An address_space flag,
e.g. AS_INACCESSIBLE, would be the easiest thing to implement.  Or maybe even
make it AS_DONT_ZERO_ON_TRUNCATE_DAMMIT (mostly joking).

Or a hook in address_space_operations to zero the folio, which conceptually is
better in many ways, but feels like overkill.

next prev parent reply	other threads:[~2023-10-05 23:48 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-21 20:14 [RFC PATCH v2 0/6] KVM: gmem: Implement test cases for error_remove_page isaku.yamahata
2023-09-21 20:14 ` [RFC PATCH v2 1/6] KVM: gmem: Truncate pages on punch hole isaku.yamahata
2023-09-21 20:37   ` Sean Christopherson
2023-09-21 21:34     ` Sean Christopherson
2023-10-05 17:52       ` Michael Roth
2023-10-05 23:48         ` Sean Christopherson [this message]
2023-09-21 20:14 ` [RFC PATCH v2 2/6] KVM: selftests: Add negative test cases for punch hole for guest_memfd() isaku.yamahata
2023-09-21 20:14 ` [RFC PATCH v2 3/6] KVM: selftests: Add tests for punch hole on guest_memfd isaku.yamahata
2023-09-21 20:40   ` Sean Christopherson
2023-09-21 20:14 ` [RFC PATCH v2 4/6] KVM: gmem: Add ioctl to inject memory failure on guest memfd isaku.yamahata
2023-09-21 21:29   ` Sean Christopherson
2023-09-21 21:53   ` Sean Christopherson
2023-09-21 20:14 ` [RFC PATCH v2 5/6] KVM: selftests: Add test cases for KVM_GUEST_MEMORY_FAILURE isaku.yamahata
2023-09-21 20:14 ` [RFC PATCH v2 6/6] KVM: guest_memfd: selftest: Add test case for error_remove_page method isaku.yamahata
2023-09-21 23:22   ` Sean Christopherson
2023-09-21 20:29 ` [RFC PATCH v2 0/6] KVM: gmem: Implement test cases for error_remove_page Sean Christopherson
2023-09-22 19:40   ` Isaku Yamahata
2023-09-22 20:32     ` Sean Christopherson
2023-09-28 17:14       ` Paolo Bonzini
2023-09-29  2:22 ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZR9LYhpxTaTk6PJX@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chen.bo@intel.com \
    --cc=dmatlack@google.com \
    --cc=erdemaktas@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jarkko@kernel.org \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=sagis@google.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=wei.w.wang@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yuan.yao@linux.intel.com \
    --cc=zhi.wang.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).