From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 451A926D4CA for ; Wed, 1 Oct 2025 06:21:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759299712; cv=none; b=HQQmCCEBiONjGk22m5dPirTTMWSssPbdsvSEvxI5iP/GggwPVaUM5ZXaLEAdgIppOewE8W71NrKnFX7+R8tq83bLT96j/8TA/UiL9pC3+FwU62zEH2oTBj/jDavgmcnvEiazdxLPXTJSfjuvXrLs6QXEagrTvuCChYQ1gzbOuxg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759299712; c=relaxed/simple; bh=CnpYtzn8PUc0ZBYLdHGVpOdTJPVwWMwJAwS94JAPcks=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=cxXfexyavRgmrEJuk4eRErACAlkmaYwP25Mpfoeeh4Dha3fr2VylTUJUoEH2mUT93UrbTlGqO9KYrRUNrosO1VumtKOieHWu2DIPTimpLedXc1ctsmcROzwNI+/ihn8PLZwVNhBPvoaCY1A2HC3wiip0gVv4qDYx06H/2L9Cj+w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=g83fPaov; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="g83fPaov" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-269880a7bd9so82957585ad.3 for ; Tue, 30 Sep 2025 23:21:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1759299708; x=1759904508; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=gKyzk1pWtrNabjmeFySfVdKtKDt4rOVB/cSEsmccNXs=; b=g83fPaov2u0cDHkCnR3/lI62BOO7z45fEvMJ42bbRY2JHWLE2gm6K/XMqfEz3COvTd Af1pl36jnPeYTE3JQXHnRta4hZMGs8CnuUte8NrR4foGZts5w5u/uteEW5V7/weTcL2q LP4X2pyeWlVIIgldfuvZOB7o4Df3B+j/I+hSIDrpS15VElNbbQqwlxc2n4prLJ+vc5O9 jfXwXPikix3fuejz14TbEpEx8z6Nczw2xFJl29pM3BtD61gBAqxRo4kUCCrvaOiAnvlh PE1FXfWtk5boNvTBr4u6egTqUc42WsaPfTcWbca9VGjjHIJAckly5WpQOJ0jpNddGDje ZRAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759299708; x=1759904508; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=gKyzk1pWtrNabjmeFySfVdKtKDt4rOVB/cSEsmccNXs=; b=et+mSt0lFdp5tRzaSqggCq9YOwRqW9+XrVy7N2R0bFGzS/o+CiS1WSfU9JohiRq5MP ci9u2SbEIDpaFpHzIC67SkLUXotbeFtx5TvMkCKhypqNoGUUjhX9XWaL9yzGw6urwnb0 yfWLcbP3ixwumuRN12bLs5EziQI0dF1V1ebqGYLs16dGYwW9Jw+Bzuf5+QgMCwxEtFHv FwV6y5tYKld8VWWI1gMfcFP6b6EYoq1Fm3J9syuReYW2lsY907uBhV3JYt+ieyCWBOiI b2vfqkWFxbadZvBGMPczZE96hJdps8XDT+6Lc8Ssrk7t0MDnCmrMu2ELbVRPzgHso/Uh he5g== X-Forwarded-Encrypted: i=1; AJvYcCUeADkKqvB4SnnxdcHoyz6m6l//ZW1I5TBKyikXipGXiywS9H+qRD+dmyOaJPLm5bdvMr0=@vger.kernel.org X-Gm-Message-State: AOJu0Yyb+Jq3DV/LzO/zp4xbZoSkQstODqMfFwzpBQvBihcJ5o5oQAer bEl5FgVofZBorDSXcoVoxYA4Tp/SOfbRfLCmLZYqx2StCjMVfSeBPCI/9MG1XK0c7ZZ4daP7YdM HybmqJFnpIBeEyr1dsE+c0O8Uwg== X-Google-Smtp-Source: AGHT+IHE0M7fuh9/BYVbAm929jP+iKH/pWdowlRc+CxCehWBNW7QMDY1Rz46MkircjbJVl76AmtAEEEYRxQN81OqOg== X-Received: from plbjw19.prod.google.com ([2002:a17:903:2793:b0:273:c5f4:a8ca]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:1251:b0:283:c950:a783 with SMTP id d9443c01a7336-28e7f437ef7mr31963025ad.56.1759299708586; Tue, 30 Sep 2025 23:21:48 -0700 (PDT) Date: Wed, 01 Oct 2025 06:21:47 +0000 In-Reply-To: <20250807094503.4691-1-yan.y.zhao@intel.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250807093950.4395-1-yan.y.zhao@intel.com> <20250807094503.4691-1-yan.y.zhao@intel.com> Message-ID: Subject: Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion From: Ackerley Tng To: Yan Zhao , pbonzini@redhat.com, seanjc@google.com Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, x86@kernel.org, rick.p.edgecombe@intel.com, dave.hansen@intel.com, kas@kernel.org, tabba@google.com, quic_eberman@quicinc.com, michael.roth@amd.com, david@redhat.com, vannapurve@google.com, vbabka@suse.cz, thomas.lendacky@amd.com, pgonda@google.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, ira.weiny@intel.com, isaku.yamahata@intel.com, xiaoyao.li@intel.com, binbin.wu@linux.intel.com, chao.p.peng@intel.com, yan.y.zhao@intel.com Content-Type: text/plain; charset="UTF-8" Yan Zhao writes: Thanks Yan! Just got around to looking at this, sorry about the delay! > In TDX, private page tables require precise zapping because faulting back > the zapped mappings necessitates the guest's re-acceptance. Therefore, > before performing a zap for hole punching and private-to-shared > conversions, huge leafs that cross the boundary of the zapping GFN range in > the mirror page table must be split. > > Splitting may result in an error. If this happens, hole punching and > private-to-shared conversion should bail out early and return an error to > userspace. > > Splitting is not necessary for kvm_gmem_release() since the entire page > table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not > map more than one physical folio. > > Therefore, in this patch, > - break kvm_gmem_invalidate_begin_and_zap() into > kvm_gmem_invalidate_begin() and kvm_gmem_zap() and have > kvm_gmem_release() and kvm_gmem_error_folio() to invoke them. > I think perhaps separating invalidate and zip could be a separate patch from adding the split step into the flow, that would make this patch smaller and easier to review. No action required from you for now, I have the the above part in a separate patch already (not yet posted). > - have kvm_gmem_punch_hole() to invoke kvm_gmem_invalidate_begin(), > kvm_gmem_split_private(), and kvm_gmem_zap(). > Bail out if kvm_gmem_split_private() returns error. > IIUC the current upstream position is that hole punching will not be permitted for ranges smaller than the page size for the entire guest_memfd. Hence no splitting required during hole punch? + 4K guest_memfd: no splitting required since the EPT entries will not be larger than 4K anyway + 2M and 1G (x86) guest_memfd: no splitting required since the entire EPT entry will have to go away for valid ranges (valid ranges are either 2M or 1G anyway) Does that sound right? > - drop the old kvm_gmem_unmap_private() and have private-to-shared > conversion to invoke kvm_gmem_split_private() and kvm_gmem_zap() instead. > Bail out if kvm_gmem_split_private() returns error. > > Co-developed-by: Ackerley Tng > Signed-off-by: Ackerley Tng > Signed-off-by: Yan Zhao > > [...snip...] > > @@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode, > struct conversion_work *work, > bool to_shared, pgoff_t *error_index) > { > + int ret = 0; > + > if (to_shared) { > struct list_head *gmem_list; > struct kvm_gmem *gmem; > @@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode, > work_end = work->start + work->nr_pages; > > gmem_list = &inode->i_mapping->i_private_list; > + list_for_each_entry(gmem, gmem_list, entry) { > + ret = kvm_gmem_split_private(gmem, work->start, work_end); > + if (ret) > + return ret; > + } Will be refactoring the conversion steps a little for the next version of this series, hence I'd like to ask about the requirements before doing splitting. The requirement is to split before zapping, right? Other than that we technically don't need to split before checking for a safe refcount, right? > list_for_each_entry(gmem, gmem_list, entry) > - kvm_gmem_unmap_private(gmem, work->start, work_end); > + kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE); > } else { > unmap_mapping_pages(inode->i_mapping, work->start, > work->nr_pages, false); > > if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start, > work->nr_pages, error_index)) { > - return -EAGAIN; > + ret = -EAGAIN; > } > } > > - return 0; > + return ret; > } > > > [...snip...] > > @@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol > start = folio->index; > end = start + folio_nr_pages(folio); > > - list_for_each_entry(gmem, gmem_list, entry) > - kvm_gmem_invalidate_begin_and_zap(gmem, start, end); > + /* The size of the SEPT will not exceed the size of the folio */ I think splitting might be required here, but that depends on whether we want to unmap just a part of the huge folio or whether we want to unmap the entire folio. Lots of open questions on memory failure handling, but for now I think this makes sense. > + list_for_each_entry(gmem, gmem_list, entry) { > + enum kvm_gfn_range_filter filter; > + > + kvm_gmem_invalidate_begin(gmem, start, end); > + filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED; > + kvm_gmem_zap(gmem, start, end, filter); > + } > > /* > * Do not truncate the range, what action is taken in response to the > -- > 2.43.2