From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9D49741E57 for ; Thu, 5 Oct 2023 23:48:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="nMY0XAkO" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-59f7d109926so22275457b3.2 for ; Thu, 05 Oct 2023 16:48:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696549731; x=1697154531; darn=lists.linux.dev; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=+lEPeKg/GP3Y80L6NB8ajXYC+n4V3di2jIUF82A5N1I=; b=nMY0XAkO1vrYBlz/3FakXY8TOf7Rt1Xnib12pzAnqHiVxMFmahsPxvCYn8GidYM/dR /pWVm08iFYYb1OSZsjfGm9qrXYTHeTV9TYP8T8jIc4wsnceInq46Cr96DnhPGeloxG5P 7xZypaC43gGqenUrh7wrqIxOcP7G28Qd2ZeSD/bsd2qUOOWkoRrSsGQfG3K+0ms2kp/H YtGRPgRCo8MtN+92IL4weh2/+b4pz/QVcF2PN7VThyHhe7pZUyBfBw3dNrIStywpsMWB tfoLSBjkmk2yAWrKtNHfhS6Z8S+fUmH7/J7pLLlMhDbT1UG+bvt4UqEQY7+a0OIV9zSU 2j1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696549731; x=1697154531; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+lEPeKg/GP3Y80L6NB8ajXYC+n4V3di2jIUF82A5N1I=; b=bKeaD/AjPpNYibDxJAnEq3BfPkwDQmI/P2P+JJQ9Uriq+zJ6RrQP+5kDv/qODgw7XS QGZmtx9g1NNX/WE9pnAL28KknbJjFSiyOG+AMjQQhmszVOsOvGL6UL2C5JhpcHZUpGhz GR/lhMVw41PifzhYwLZQAtQLdPXkVMQeK/dlBsbcUL6yXr/8Z7ZEOU8ddreImIGJo8jE ZY+EMuvYi9QwAXoEDtZLNX8lYy2SxO2J78D+sUmBSmgaZsd9bgeyxviOF8LEX2nLiHfu L+q53z0WvzhpKWRD83xHBVfDpAmdF67X/5/5sRHGASz8B07DwkPoRqubju7VlgOhFjxC V5Lw== X-Gm-Message-State: AOJu0Yzcbs6RrndvDkZs1W/qXukmT/nZkSsb7XKoGA3k5w/ix3oYSqE5 pIVEhjgViMuIAx8M931whXt2fqtFr1k= X-Google-Smtp-Source: AGHT+IHRi3WHSS0eRi4G7jOWTC1zbVq5PcSR3DpHe2Rifw8FnLaa5w+GwmW/eLasxxswlFxA+y9CeGzvWuY= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:1f56:0:b0:d7e:7a8a:2159 with SMTP id f83-20020a251f56000000b00d7e7a8a2159mr100071ybf.5.1696549731485; Thu, 05 Oct 2023 16:48:51 -0700 (PDT) Date: Thu, 5 Oct 2023 16:48:50 -0700 In-Reply-To: <20231005175238.7bb2zut4fb7ebdqc@amd.com> Precedence: bulk X-Mailing-List: linux-coco@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20231005175238.7bb2zut4fb7ebdqc@amd.com> Message-ID: Subject: Re: [RFC PATCH v2 1/6] KVM: gmem: Truncate pages on punch hole From: Sean Christopherson To: Michael Roth Cc: isaku.yamahata@intel.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com, Paolo Bonzini , erdemaktas@google.com, Sagi Shahar , David Matlack , Kai Huang , Zhi Wang , chen.bo@intel.com, linux-coco@lists.linux.dev, Chao Peng , Ackerley Tng , Vishal Annapurve , Yuan Yao , Jarkko Sakkinen , Xu Yilun , Quentin Perret , wei.w.wang@intel.com, Fuad Tabba Content-Type: text/plain; charset="us-ascii" On Thu, Oct 05, 2023, Michael Roth wrote: > On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote: > > On Thu, Sep 21, 2023, Sean Christopherson wrote: > > > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c > > > > index a819367434e9..01fb4ca861d0 100644 > > > > --- a/virt/kvm/guest_mem.c > > > > +++ b/virt/kvm/guest_mem.c > > > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start, > > > > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) > > > > { > > > > struct list_head *gmem_list = &inode->i_mapping->private_list; > > > > + struct address_space *mapping = inode->i_mapping; > > > > pgoff_t start = offset >> PAGE_SHIFT; > > > > pgoff_t end = (offset + len) >> PAGE_SHIFT; > > > > struct kvm_gmem *gmem; > > > > > > > > + /* > > > > + * punch hole may result in zeroing partial area. As pages can be > > > > + * encrypted, prohibit zeroing partial area. > > > > + */ > > > > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK) > > > > + return -EINVAL; > > > > > > This should be unnecessary, kvm_gmem_fallocate() does > > > > > > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > > return -EINVAL; > > > > > > before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test > > > fails, then that code needs to be fixed. I'll run your test to double-check, > > > but AFAICT this is unnecesary. > > > > I confirmed that the testcase passes without the extra checks. Just to close the > > loop, what prompted adding more checks to kvm_gmem_punch_hole()? > > I don't know if it's the same issue that Isaku ran into, but for SNP we > hit a similar issue with the truncate_inode_pages_range(lstart, lend) call. > > The issue in that case was a bit more subtle: > > - userspace does a hole-punch on a 4K range of its gmem FD, which happens > to be backed by a 2MB folio. > - truncate_inode_pages_range() gets called for that 4K range > - truncate_inode_pages_range() does special handling on the folios at the > start/end of the range in case they are partial and passes these to > truncate_inode_partial_folio(folio, lstart, lend). In this case, there's > just the 1 backing folio. But it *still* gets the special treatment, and > so gets passed to truncate_inode_partial_folio(). > - truncate_inode_partial_folio() will then zero that 4K range, even though > it is page-aligned, based on the following rationale in the comments: > > /* > * We may be zeroing pages we're about to discard, but it avoids > * doing a complex calculation here, and then doing the zeroing > * anyway if the page split fails. > */ > folio_zero_range(folio, offset, length); > > - after that, .invalidate_folio callback is issued, then the folio is split, > and the caller (truncate_inode_pages_range()) does another pass through > the whole range and can free the now-split folio then .free_folio callbacks > are issued. > > Because of that, we can't rely on .invalidate_folio/.free_folio to handle > putting the page back into a normal host-accessible state, because the > zero'ing will happen beforehand. Argh, and that causes an RMP violation #PF. FWIW, I don't *think* zeroing would be problematic for TDX. The page would get poisoned, but KVM would re-zero the memory with MOVDIR64B and flush the cache. > That's why we ended up needing to do this for SNP patches to make sure > arch-specific invalidation callbacks are issued before the truncation occurs: > > https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233 > > I'd planned to post those as a separate RFC to discuss, but when I came across > this it seemed like it might be relevant to what the TDX folks might ran into > here. > > If not for the zero'ing logic mentioned above, for SNP at least, the > .free_folio() ends up working pretty nicely for both truncation and fput(), > and even plays nicely with live update use-case where the destination gmem > instance shares the inode->i_mapping, since iput() won't trigger the > truncate_inode_pages_final() until the last reference goes away so we don't > have to do anything special in kvm_gmem_release() to determine when we > should/shouldn't issue the arch-invalidations to clean up things like the > RMP table. > > It seems like the above zero'ing logic could be reworked to only zero non-page > aligned ranges (as the comments above truncate_inode_pages_range() claim > should be the case), which would avoid the issue for the gmem use-case. But I > wonder if some explicit "dont-zero-these-pages" flag might be more robust. > > Or maybe there's some other way we should be going about this? Skipping the write seems like the obvious solution. An address_space flag, e.g. AS_INACCESSIBLE, would be the easiest thing to implement. Or maybe even make it AS_DONT_ZERO_ON_TRUNCATE_DAMMIT (mostly joking). Or a hook in address_space_operations to zero the folio, which conceptually is better in many ways, but feels like overkill.