From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2FC18C4332F
	for <kvm@archiver.kernel.org>; Wed, 23 Nov 2022 17:18:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237539AbiKWRSV (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 23 Nov 2022 12:18:21 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39042 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S239221AbiKWRSF (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 23 Nov 2022 12:18:05 -0500
Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com [IPv6:2607:f8b0:4864:20::52b])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7B16490BD
        for <kvm@vger.kernel.org>; Wed, 23 Nov 2022 09:18:02 -0800 (PST)
Received: by mail-pg1-x52b.google.com with SMTP id n17so17293371pgh.9
        for <kvm@vger.kernel.org>; Wed, 23 Nov 2022 09:18:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date:from:to
         :cc:subject:date:message-id:reply-to;
        bh=wrS07cTC5K5Tj39aOQ2pHWaQp/nnwsNDHjx9FaHsOZw=;
        b=hKlCMfoNbebyaz0GGzRhHybmF6QIFeHL8ypqHOA7H6Qu3U3+67CjMYRiCpvMZQj2iU
         UldfcZ77FGWYA0tbPU17LtXHlHC/16+lOUJtKYU0MxXlE0+uaJlph2jtIch1d0yCCiFk
         bJ6FTNxHvLyxmxWeMlOeI2OPXO5QepiHfZGGTzGMErdK6B7zrbaKhertNIjZ4TGfNfrl
         623Kf+4oiQs12tLhvUhwXFygPjCOBLcGkPBESqAR4c3A9jJ+N90VmbY/YnSsnoQfO9em
         dcWD5MsbKEQKIT1Kgy6K+CX2dGLTHmRfXdw2/Hlp2TvF1u0ujiYkyd1ky9jl+SrzAOGY
         LAZw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=wrS07cTC5K5Tj39aOQ2pHWaQp/nnwsNDHjx9FaHsOZw=;
        b=qbxYWR5yzbOkL+dst/lBEVXnAZO/vPrYFUdCqEvOxoq3Afx+DvWFdoQ2OJCv2Oo0JJ
         ssIMHEAfy9PH7GcrejwqVHkqd1Bsybf/Hrax9G9msxhH6BUClXxSbSCYVeMaZcsIcvq/
         aeNhbL/EAXnFJQ25+0k0bIne9YkvZldC2a1779L6WU7E0QbekY4MIIAs96mPIHQ6dulV
         AkViz0yXqDhVApCARx4mONKTCeqKvpD8QQcT6tLWv1DWCYWxHUc7vJ0pTOcj9wFS7HBr
         M8A+ii2DBxV0tUcT5qOIspVmXk2ocCjkfc/bzf2I4KjXxk1zXueclOqtHAATt6rvRXfE
         Qk+g==
X-Gm-Message-State: ANoB5pmrLdNtBTSrTOnzHscoApIBSnm3bh5CaQdzaG78ZfCY3ZwZd8RN
        AHH0RTpu4hYnNbJJ5/LQdU4S7A==
X-Google-Smtp-Source: AA0mqf5NHQuFjEK1f1dMpwRtz5Fe72kx+Dmi96MPegsu5xq9WODjpbMEGJeg5dmaj/0E09k5vBEadA==
X-Received: by 2002:a65:4948:0:b0:46e:be03:d9b5 with SMTP id q8-20020a654948000000b0046ebe03d9b5mr8014187pgs.495.1669223882291;
        Wed, 23 Nov 2022 09:18:02 -0800 (PST)
Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7])
        by smtp.gmail.com with ESMTPSA id i29-20020a63131d000000b00460ea630c1bsm10936999pgl.46.2022.11.23.09.18.01
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 23 Nov 2022 09:18:01 -0800 (PST)
Date:   Wed, 23 Nov 2022 17:17:57 +0000
From:   Sean Christopherson <seanjc@google.com>
To:     David Woodhouse <dwmw2@infradead.org>
Cc:     Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org,
        mhal@rbox.co
Subject: Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared
 runstate area
Message-ID: <Y35VxflJBVjzloaj@google.com>
References: <20221119094659.11868-1-dwmw2@infradead.org>
 <20221119094659.11868-2-dwmw2@infradead.org>
 <Y30XVDXmkAIlRX4N@google.com>
 <a12c8e6123cf702bc882988f9da3be7bd096a2e3.camel@infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <a12c8e6123cf702bc882988f9da3be7bd096a2e3.camel@infradead.org>
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

On Tue, Nov 22, 2022, David Woodhouse wrote:
> On Tue, 2022-11-22 at 18:39 +0000, Sean Christopherson wrote:
> > On Sat, Nov 19, 2022, David Woodhouse wrote:
> > > From: David Woodhouse <
> > > dwmw@amazon.co.uk
> > > >
> > > 
> > > The guest runstate area can be arbitrarily byte-aligned. In fact, even
> > > when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> > > fields in the structure end up being unaligned due to the fact that the
> > > 32-bit ABI only aligns them to 32 bits.
> > > 
> > > So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> > > is buggy, because if it's unaligned then we can't update the whole field
> > > atomically; the low bytes might be observable before the _UPDATE bit is.
> > > Xen actually updates the *byte* containing that top bit, on its own. KVM
> > > should do the same.
> > 
> > I think we're using the wrong APIs to update the runstate.  The VMCS/VMCB pages
> > _need_ the host pfn, i.e. need to use a gpc (eventually).  The Xen PV stuff on the
> > other hand most definitely doesn't need to know the pfn.
> > 
> > The event channel code would be difficult to convert due to lack of uaccess
> > primitives, but I don't see anything in the runstate code that prevents KVM from
> > using a gfn_to_hva_cache.  That will naturally handle page splits by sending them
> > down a slow path and would yield far simpler code.
> > 
> > If taking the slow path is an issue, then the guest really should be fixed to not
> > split pages.  And if that's not an acceptable answer, the gfn_to_hva_cache code
> > could be updated to use the fast path if the region is contiguous in the host
> > virtual address space.
> > 
> 
> Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
> thing I tried. There are a handful of complexifying factors, none of
> which are insurmountable if we try hard enough.
> 
>  • Even if we fix the gfn_to_hva_cache to still use the fast path for
>    more than the first page, it's still possible for the runstate to
>    cross from one memslot to an adjacent one. We probably still need
>    two of them, which is a large part of the ugliness of this patch.

Hrm.  What if KVM requires that the runstate be contiguous in host virtual address
space?  That wouldn't violate Xen's ABI since it's a KVM restriction, and it can't
break backwards compatibility since KVM doesn't even handle page splits, let alone
memslot splits.  AFAIK, most (all?) userspace VMMs make guest RAM virtually
contiguous anyways.

Probably a moot point since I don't see a way around the "page got swapped out"
issue.

>  • Accessing it via the userspace HVA requires coping with the case
>    where the vCPU is being scheduled out, and it can't sleep. The
>    previous code before commit 34d41d19e dealt with that by using 
>    pagefault_disable() and depending on the GHC fast path. But even
>    so...
> 
>  • We also could end up having to touch page B, page A then page B
>    again, potentially having to abort and leave the runstate with
>    the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
>    version which checks that both pages are present before it starts
>    writing anything to the guest.

Ugh, and because of the in-atomic case, there's nothing KVM can do to remedy page B
getting swapped out.  Actually, it doesn't even require a B=>A=>B pattern.  Even
without a page split, the backing page could get swapped out between setting and
clearing.

> As I said, they're not insurmountable but that's why I ended up with
> the patch you see, after first attempting to use a gfn_to_hva_cache
> again. Happy to entertain suggestions.

What if we use a gfn_to_pfn_cache for the page containing XEN_RUNSTATE_UPDATE,
and then use kvm_vcpu_write_guest_atomic() if there's a page split?  That would
avoid needing to acquire mutliple gpc locks, and the update could use a single
code flow by always constructing an on-stack representation.  E.g. very roughly:

	*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
	smp_wmb();

	if (khva)
		memcpy(khva, rs_state, user_len);
	else
		kvm_vcpu_write_guest_in_atomic(vcpu, gpa, rs_state, user_len);
	smp_wmb();

	*update_bit = vx->runstate_entry_time >> 56;
	smp_wmb();

where "khva" is NULL if there's a page split.

> I note we have a kvm_read_guest_atomic() and briefly pondered adding a
> 'write' version of same... but that doesn't seem to cope with crossing
> memslots either; it also assumes its caller only uses it to access
> within a single page.

We should fix the lack of page split handling.  There are no bugs because KVM
only uses the helper for cases where the accesses are naturally aligned, but that's
bound to bite someone eventually.  That would be an oppurtunity to dedup the code
that handles "segments" (ugh), e.g. instead of

	gfn_t gfn = gpa >> PAGE_SHIFT;
	int seg;
	int offset = offset_in_page(gpa);
	int ret;

	while ((seg = next_segment(len, offset)) != 0) {
		ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg);
		if (ret < 0)
			return ret;
		offset = 0;
		len -= seg;
		data += seg;
		++gfn;
	}
	return 0;

provide a macro that allows

	kvm_for_each_chunk(...)
		ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg);
		if (ret)
			return ret;
	}
	
I also think we should rename the "atomic" helpers to be "in_atomic" (or inatomic
if we want to follow the kernel's terrible nomenclature, which I always read as
"not atomic").  I always think read_guest_atomic() means an "atomic read".