From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 600BB2ECE9F for ; Fri, 11 Jul 2025 15:40:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752248403; cv=none; b=BmA/cbb1xTLRgnxGCZwMCMPddQhg8jzOEmci0KmzAMfLlgG3bhMuBlIdlAZQH/v5OlSmS5decowEEN3DkbSAOxMBrpR9TcG0915i3Duxv89HJ1ZkXHXN8foxfe7yu8blemKu2eofZp2OMmQiQKZJ39SKrfFIfy2UIFpYKoEoeSU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752248403; c=relaxed/simple; bh=awESprVhew4kSmDa2kR0oS13/qs3PDd2SwyjUZfd2x8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=bflGKm024sIke6YLR4EBexX3kS4hBYNHyJIIYRtD7mQBfw4aVRtpgirTg7pKgCx/qQE4w9kKKPzeb2kbypm21/ZEIA0ZsqEIoqN4RUYqjavdKQZVH3GqQMzoGP2NcSTb3rhFkAzGsaoLNgLqAThKVtawfFMYnrTF7HrIjNaDkKM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=E/jMWBDZ; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="E/jMWBDZ" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-b3642ab4429so1735625a12.0 for ; Fri, 11 Jul 2025 08:40:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1752248401; x=1752853201; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fn40LpJUU3LdWkylyg01Pxkvo0K0KUfHBLAV1tZ0MCI=; b=E/jMWBDZjwen7wg1lh7ucJ4kpFoEnjtNTILUEQECgVqKFtPJzHqUzppdOATIK8wV79 A0Nswm9TahTWFJol1DlFvO5BpFd9i43M4kour9H4I+EEDYPHTa3oZH3SKtDaCKpUbtj0 DceJUTCDolGFOu71REue2C3hTV3/QkfRnpT+dj8iP8mSowTE3NgSuJfRuUW1zUhGcqIm 58T5ZxW2WMaqG74L+Qb8+2RjIbaYzeWQHu3XvGRZp8o1JSeC6VTOMv1ykUurlt+1LmOJ WUbdLTZ/s3R6c0ae0Y6cGTkSl+zBr/2QDwlNvSve0UpKpdE3UBubR1b/c70+bdIOI6bo hprA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752248401; x=1752853201; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fn40LpJUU3LdWkylyg01Pxkvo0K0KUfHBLAV1tZ0MCI=; b=YsnQYLixV0wFp55ow5TxDI5F65umCJzI+HjZkrP5Expq5UhWsBcnMOWcSmLnyhXwnV n45U0pUYcjbZZDYb9NqDyYa43ReoFCsfbMcf36dLpeG9ol2eRJDeKqsdqkMSu6vQhYbp wLg57yZFSTAuFs5d4Rx5iUZN9XRpl+P/f2v/iz7XlpLUGovXod0J/jt5gEUQLAHs3Z+H ceTfkF1tZPWi4ntwJY/pkjEKeHXFHagveo/09fNYHWT9C6+AUmNkGAK9ordn/XjyJRHR nCUpEUVIi4pSK8663cLtOxed6WedEOfzmzHk6sq/VnAW8b7MxaPMKI9ZXxn6+MzuP3xf 6sMQ== X-Forwarded-Encrypted: i=1; AJvYcCX90E5mJNiQmvJapEfm3ZAKeDRAM+53BbKlYpsvTTfHGBT2u2E0D+sf9L/1T21omIJnxT954ghN4Z9NZh4=@vger.kernel.org X-Gm-Message-State: AOJu0YyHz/ClgL9GEgYligdbkr9HqZAONkKXV1i858thrDlRgK+/kPPt De1KjLRq2FD81lg8hDhGWWQEWz/y7/ywtwH+SCQdbOPlmQLkYjLLZXkN+rE9PjY1bUjuOyGu1cV CCtnTDg== X-Google-Smtp-Source: AGHT+IE9O4C39pAH9Ofqerpd+uPD0mxUhbBaJsCEpsGxUKxvzqB4j7sPlFcbvMQYVY35N3LkUHU7f1aXv2s= X-Received: from plblm6.prod.google.com ([2002:a17:903:2986:b0:236:9738:9180]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:ce07:b0:238:2990:6382 with SMTP id d9443c01a7336-23dedbb4ademr45200695ad.0.1752248400686; Fri, 11 Jul 2025 08:40:00 -0700 (PDT) Date: Fri, 11 Jul 2025 08:39:59 -0700 In-Reply-To: <20250711151719.goee7eqti4xyhsqr@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250703062641.3247-1-yan.y.zhao@intel.com> <20250709232103.zwmufocd3l7sqk7y@amd.com> <20250711151719.goee7eqti4xyhsqr@amd.com> Message-ID: Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() From: Sean Christopherson To: Michael Roth Cc: Yan Zhao , pbonzini@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, rick.p.edgecombe@intel.com, kai.huang@intel.com, adrian.hunter@intel.com, reinette.chatre@intel.com, xiaoyao.li@intel.com, tony.lindgren@intel.com, binbin.wu@linux.intel.com, dmatlack@google.com, isaku.yamahata@intel.com, ira.weiny@intel.com, vannapurve@google.com, david@redhat.com, ackerleytng@google.com, tabba@google.com, chao.p.peng@intel.com Content-Type: text/plain; charset="us-ascii" On Fri, Jul 11, 2025, Michael Roth wrote: > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote: > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch > > log: > > > > Problem > > === > > ... > > (2) > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock, > > resulting in the following lock sequence in tdx_vcpu_init_mem_region(): > > - filemap invalidation lock --> mm->mmap_lock > > > > However, in future code, the shared filemap invalidation lock will be held > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence: > > - mm->mmap_lock --> filemap invalidation lock > > I wouldn't expect kvm_gmem_fault_shared() to trigger for the > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it). Irrespective of shared faults, I think the API could do with a bit of cleanup now that TDX has landed, i.e. now that we can see a bit more of the picture. As is, I'm pretty sure TDX is broken with respect to hugepage support, because kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals with one page at a time. So that needs to be changed. I assume it's already address in one of the many upcoming series, but it still shows a flaw in the API. Hoisting the retrieval of the source page outside of filemap_invalidate_lock() seems pretty straightforward, and would provide consistent ABI for all vendor flavors. E.g. as is, non-struct-page memory will work for SNP, but not TDX. The obvious downside is that struct-page becomes a requirement for SNP, but that The below could be tweaked to batch get_user_pages() into an array of pointers, but given that both SNP and TDX can only operate on one 4KiB page at a time, and that hugepage support doesn't yet exist, trying to super optimize the hugepage case straightaway doesn't seem like a pressing concern. static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot, struct file *file, gfn_t gfn, void __user *src, kvm_gmem_populate_cb post_populate, void *opaque) { pgoff_t index = kvm_gmem_get_index(slot, gfn); struct page *src_page = NULL; bool is_prepared = false; struct folio *folio; int ret, max_order; kvm_pfn_t pfn; if (src) { ret = get_user_pages((unsigned long)src, 1, 0, &src_page); if (ret < 0) return ret; if (ret != 1) return -ENOMEM; } filemap_invalidate_lock(file->f_mapping); if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1, KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM_MEMORY_ATTRIBUTE_PRIVATE)) { ret = -EINVAL; goto out_unlock; } folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); if (IS_ERR(folio)) { ret = PTR_ERR(folio); goto out_unlock; } folio_unlock(folio); if (is_prepared) { ret = -EEXIST; goto out_put_folio; } ret = post_populate(kvm, gfn, pfn, src_page, opaque); if (!ret) kvm_gmem_mark_prepared(folio); out_put_folio: folio_put(folio); out_unlock: filemap_invalidate_unlock(file->f_mapping); if (src_page) put_page(src_page); return ret; } long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, kvm_gmem_populate_cb post_populate, void *opaque) { struct file *file; struct kvm_memory_slot *slot; void __user *p; int ret = 0; long i; lockdep_assert_held(&kvm->slots_lock); if (npages < 0) return -EINVAL; slot = gfn_to_memslot(kvm, start_gfn); if (!kvm_slot_can_be_private(slot)) return -EINVAL; file = kvm_gmem_get_file(slot); if (!file) return -EFAULT; npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); for (i = 0; i < npages; i ++) { if (signal_pending(current)) { ret = -EINTR; break; } p = src ? src + i * PAGE_SIZE : NULL; ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, p, post_populate, opaque); if (ret) break; } fput(file); return ret && !i ? ret : i; }