From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D124F6E2B0 for ; Tue, 6 Feb 2024 02:54:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707188057; cv=none; b=aKOBHreXTsSGm3sBP/0IWvcRsgPMLGOSc3DCu3t6XqW9Xe0qVPqVjyuww4UeYoYpqEceKRBeYoCSIEBOZsugMPG4dwDcPPZ9hp18ElYK0bgSLhpK48Tg0Z9G7+2WU7UhINoEEfGmzdSi5MW+Vz+rMtatBZOnd+BQXkj6fuFUWlQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707188057; c=relaxed/simple; bh=D2q0b2o4Z7APki3fn4YvD4ddqaMhZR/spymtio8X6wQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=pf/lYscPJPFj6GVAcOvq0VDjt8rMIIhSK2XzjSZos9ToAPmW7XCMHJHWsBQ7Sdo8yTdTflRisB7/lvYCsYbNnFvwOMaTXwez1mto8Bb+6w2yoQsuVoGVHOa2woU6TedckG29d0hhiTHKgc63eTKyHcrGyAV+SHA2m02Pa4w91dg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=V+de5hb8; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="V+de5hb8" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-6e0382c0447so2245467b3a.3 for ; Mon, 05 Feb 2024 18:54:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707188055; x=1707792855; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=BxnDC9PEjh2Qaq1re7GgqS1U0z2T5fJxN7gaOqoJ4CQ=; b=V+de5hb8w9zg3BeTfo6MPug76mBSF87FYPJuG724dxM+l8xeJsdlKj9cd+ZrM5uDUo AX/60wQmlMhOrqnoWVL4fz24A6jFZtcWiVGeaamsTBCUzXvxogYWPKu7/73Aq4p3D3wD 9B1u9ixiAwhDcRGUQXsDMaJCOVhCAqXs3k5JkL8P5Hl2v/oV/w59Z1lT5h2JLs7+Eps1 aYMZqK9tOoUT7BMN42YYvoQMnva+neKuigGVyrJ71iTZlHxdoSu6vP5zyqqMwDv8xkKT 8v7oYXoPBk7wBJqNftHVuHHuYvx/Ni/AIBHw6SNynThKhp/XUAmZrBFpQOR+EbcKtW6R RweQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707188055; x=1707792855; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=BxnDC9PEjh2Qaq1re7GgqS1U0z2T5fJxN7gaOqoJ4CQ=; b=Uq97Pxw3Yepov/bkpKozNf/OaiQXFi3odjij62mMOpnpoL2lpk1pMlTJ4MTTD7/hTS uKDZCKN1OZexlyoW3/xatREsiowpyIl2mLzL/d75ebdOTHmQAWbml+dC7wd4nflbXOBU 7yqraL2b0dcvpV0Ob0qmivhNr8q5B/lIwuCHhHU0iXmBfrRiUftqQ79FZE009/GGVQHr 4hSnfF74jiyX2BmsIVzJgEb2dRd1/u0qmoqhUCYsWl6x+flpBJQwuEMbp5xGRcKko0nl nAu2+Tgm2ncxfD65rlo2w/zLRW++TbtfIinlMIYhg2EEnM4E3IzYOXdG0U5VDxRFwuoW PHow== X-Gm-Message-State: AOJu0Yzt2Gc1IQgtVkoSWvJ2I2Y3qZeMxZi0H4tKDbqAVl+MfoxFWs/l rIO93PfuCHcf9nVEBosEhtzHuTn/xalJoiXLQqSLKbr2RNIeMgU3P7h0pTS9jXQAKpdSB1Y++aZ rKQ== X-Google-Smtp-Source: AGHT+IHSYeFR6s+Vr1FAbsEgBRv1jPSIvn/jfhYNPnvMF5ySJNTuzlbNCgmGA/LEAVQ2w75wt3mkYNdAwJc= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6a00:93a6:b0:6df:e45b:3ef8 with SMTP id ka38-20020a056a0093a600b006dfe45b3ef8mr93164pfb.3.1707188055193; Mon, 05 Feb 2024 18:54:15 -0800 (PST) Date: Mon, 5 Feb 2024 18:54:13 -0800 In-Reply-To: <91b97ed81a70c778352b2f569001820ea8b1c48b.camel@redhat.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20230911021637.1941096-1-stevensd@google.com> <20230911021637.1941096-4-stevensd@google.com> <91b97ed81a70c778352b2f569001820ea8b1c48b.camel@redhat.com> Message-ID: Subject: Re: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns From: Sean Christopherson To: Maxim Levitsky Cc: David Stevens , Yu Zhang , Isaku Yamahata , Zhi Wang , kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Tue, Oct 03, 2023, Maxim Levitsky wrote: > =D0=A3 =D0=BF=D0=BD, 2023-09-11 =D1=83 11:16 +0900, David Stevens =D0=BF= =D0=B8=D1=88=D0=B5: > > From: David Stevens > > The fact that non-refcounted pfns can no longer be accessed without mmu > > notifier protection is a breaking change. Since there is no timeline fo= r > > updating everything in KVM to use mmu notifiers, this change adds an > > opt-in module parameter called allow_unsafe_mappings to allow such > > mappings. Systems which trust userspace not to tear down such unsafe > > mappings while KVM is using them can set this parameter to re-enable th= e > > legacy behavior. >=20 > Do you have a practical example of a VM that can break with this change? > E.g will a normal VM break? will a VM with VFIO devices break? Will a VM = with > hugepages mapped into it break? >=20 > Will the trick of limiting the kernel memory with 'mem=3DX', and then use= the=20 > extra 'upper memory' for VMs still work? This is the trick that will require an opt-in from the admin. Anything whe= re KVM is effectively relying on userspace to pinky swear that the memory won't be migrated, freed, etc. It's unlikely, but theoretically possible that it might break existing setu= ps for "normal" VMs. E.g. if a VM is using VM_PFNMAP'd memory for a nested VMCS. = But such setups are already wildly broken, their users just don't know it. The= proposal here is to require admins for such setups to opt-in to the "unsafe" behavio= r, i.e. give backwards compatibility, but make the admin explicitly acknowledg= e that what they are doing may have unwanted consequences. > > + /* > > + * True if the returned pfn is for a page with a valid refcount. Fals= e > > + * if the returned pfn has no struct page or if the struct page is no= t > > + * being refcounted (e.g. tail pages of non-compound higher order > > + * allocations from IO/PFNMAP mappings). > >=20 > Aren't all tail pages not-refcounted (e.g tail page of a hugepage?) > I haven't researched this topic yet. Nope. As Christoph stated, they are most definitely "weird" allocations th= ough. In this case, IIRC, it's the DRM's Translation Table Manager (TTM) code tha= t kmalloc's a large chunk of memory, and then stuffs the pfn into the page ta= bles courtesy of the vmf_insert_pfn_prot() in ttm_bo_vm_fault_reserved(). The head page has a non-zero refcount, but it's not really refcounted. And= the tail pages have nothing, which IIRC, results in KVM inadvertantly causing p= ages to be freed due to putting the last refeferences. [*] https://lore.kernel.org/all/ZRZeaP7W5SuereMX@infradead.org > > + * > > + * When this output flag is false, callers should not try to convert > > + * the pfn to a struct page. This should explain what the flag tracks, not what callers should or should= n't do with the flag. E.g. strictly speaking, there's no danger in grabbing th= e corresponding "struct page" if the caller verifies it's a valid PFN. Tryin= g to use the page outside of mmu_notifier protection is where things would get d= icey. > > + */ > > + bool is_refcounted_page; > > }; > > =20 > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 9b33a59c6d65..235c5cb3fdac 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink; > > module_param(halt_poll_ns_shrink, uint, 0644); > > EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); > > =20 > > +/* Allow non-struct page memory to be mapped without MMU notifier prot= ection. */ > > +static bool allow_unsafe_mappings; > > +module_param(allow_unsafe_mappings, bool, 0444); > > + > > /* > > * Ordering of locks: > > * > > @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsig= ned long addr) > > return rc =3D=3D -EHWPOISON; > > } > > =20 > > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll= , > > + struct page *page) > > +{ > > + kvm_pfn_t pfn =3D page_to_pfn(page); > > + > > + foll->is_refcounted_page =3D true; > > + return pfn; > > +} >=20 > Just a matter of taste but to me this function looks confusing. > IMHO, just duplicating these two lines of code is better. > However if you prefer I won't argue over this. Hrm, when I suggested this, there was also a put_page() and a comment about= hacky code in there: static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, struct page *page) { kvm_pfn_t pfn =3D page_to_pfn(page); foll->is_refcounted_page =3D true; /* * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the call= er * doesn't want to grab a reference, but gup() doesn't support getti= ng * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ev= er * changes, drop this and simply don't pass FOLL_GET to gup(). */ if (!(foll->flags & FOLL_GET)) put_page(page); return pfn; } More below. > > - /* > > - * Get a reference here because callers of *hva_to_pfn* and > > - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the > > - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP > > - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will > > - * simply do nothing for reserved pfns. > > - * > > - * Whoever called remap_pfn_range is also going to call e.g. > > - * unmap_mapping_range before the underlying pages are freed, > > - * causing a call to our MMU notifier. > > - * > > - * Certain IO or PFNMAP mappings can be backed with valid > > - * struct pages, but be allocated without refcounting e.g., > > - * tail pages of non-compound higher order allocations, which > > - * would then underflow the refcount when the caller does the > > - * required put_page. Don't allow those pages here. > > - */ >=20 > Why the comment is removed? as far as I see the code still grabs a refere= nce > to the page. Because what the above comment is saying is effectively obsoleted by is_ref= counted_page. The old comment is essentially saying, "grab a reference because KVM expect= s to put a reference", whereas as the new code grabs a reference because it's ne= cessary to honor allow_unsafe_mappings. So I agree with David that the existing comment should go away, but I agree= with you in that kvm_follow_refcounted_pfn() really needs a comment, e.g. to exp= lain how KVM manages struct page refcounts. > > - if (!kvm_try_get_pfn(pfn)) > > - r =3D -EFAULT; > > + if (get_page_unless_zero(page)) > > + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) !=3D pfn); >=20 > Once again, the kvm_follow_refcounted_pfn usage is confusing IMHO. It se= ts > the 'foll->is_refcounted_page', and yet someone can think that it's only > there for the WARN_ON_ONCE. >=20 > That IMHO would read better: >=20 > if (get_page_unless_zero(page)) > foll->is_refcounted_page =3D true; >=20 > WARN_ON_ONCE(page_to_pfn(page) !=3D pfn); >=20 > Note that I moved the warn out of the 'get_page_unless_zero' condition > because I think that this condition should be true for non refcounted pag= es > as well. Yeah, let me see if I can piece together what happened to the put_page() ca= ll. > Also I don't understand why 'get_page_unless_zero(page)' result is ignore= d. > As I understand it, it will increase refcount of a page unless it is zero= .=20 >=20 > If a refcount of a refcounted page is 0 isn't that a bug? Yes, the problem is that KVM doesn't know if a page is refcounted or not, w= ithout actually trying to acquire a reference. See the TTM mess mentioned above. Note, Christoph is suggesting that KVM instead refuse to play nice and forc= e the TTM code to properly refcount things. I very much like that idea in theory= , but I have no idea how feasible it is (that code is all kinds of twisty). > The page was returned from kvm_pfn_to_refcounted_page which supposed only= to > return pages that are refcounted. >=20 > I might not understand something in regard to 'get_page_unless_zero(page)= ' > usage both in old and the new code.