From mboxrd@z Thu Jan  1 00:00:00 1970
From: Nitesh Narayan Lal <nitesh@redhat.com>
Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
Date: Mon, 18 Feb 2019 10:50:24 -0500
Message-ID: <cb7888fc-52c9-8735-61e4-870e7e3c0e5f@redhat.com>
References: <20190204201854.2328-1-nitesh@redhat.com>
 <af4430fd-0a2c-8f35-b767-3f93fc6db270@redhat.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature";
 boundary="mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI"
To: David Hildenbrand <david@redhat.com>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, pbonzini@redhat.com,
        lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com,
        yang.zhang.wz@gmail.com, riel@surriel.com, mst@redhat.com,
        dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com,
        aarcange@redhat.com, Alexander Duyck <alexander.duyck@gmail.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <af4430fd-0a2c-8f35-b767-3f93fc6db270@redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI
Content-Type: multipart/mixed; boundary="cW9cKcy4XGSw0A67mbAvusrJdUVpT6K3Y";
 protected-headers="v1"
From: Nitesh Narayan Lal <nitesh@redhat.com>
To: David Hildenbrand <david@redhat.com>, kvm@vger.kernel.org,
 linux-kernel@vger.kernel.org, pbonzini@redhat.com, lcapitulino@redhat.com,
 pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com,
 riel@surriel.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com,
 dhildenb@redhat.com, aarcange@redhat.com,
 Alexander Duyck <alexander.duyck@gmail.com>
Message-ID: <cb7888fc-52c9-8735-61e4-870e7e3c0e5f@redhat.com>
Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
References: <20190204201854.2328-1-nitesh@redhat.com>
 <af4430fd-0a2c-8f35-b767-3f93fc6db270@redhat.com>
In-Reply-To: <af4430fd-0a2c-8f35-b767-3f93fc6db270@redhat.com>

--cW9cKcy4XGSw0A67mbAvusrJdUVpT6K3Y
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Language: en-US


On 2/16/19 4:40 AM, David Hildenbrand wrote:
> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>
> Hi Nitesh,
>
> I thought again about how s390x handles free page hinting. As that seem=
s
> to work just fine, I guess sticking to a similar model makes sense.
>
>
> I already explained in this thread how it works on s390x, a short summa=
ry:
>
> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If =
I
> am not wrong, it contains 512 entries, so is exactly 1 page big. This
> buffer is stored in the hypervisor and is on page granularity.
>
> 2. This page buffer is managed via the ESSA instruction. In addition, t=
o
> synchronize with the guest ("page reused when freeing in the
> hypervisor"), special bits in the host->guest page table can be
> set/locked via the ESSA instruction by the guest and similarly accessed=

> by the hypervisor.
>
> 3. Once the buffer is full, the guest does a synchronous hypercall,
> going over all 512 entries and zapping them (=3D=3D similar to MADV_DON=
TNEED)
>
>
> To mimic that, we
>
> 1. Have a static buffer per VCPU in the guest with 512 entries. You
> basically have that already.
>
> 2. On every free, add the page _or_ the page after merging by the buddy=

> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
> than s390x). You basically have that already.
>
> 3. If the buffer is full, try to isolate all pages and do a synchronous=

> report to the hypervisor. You have the first part already. The second
> part would require a change (don't use a separate/global thread to do
> the hinting, just do it synchronously).
>
> 4. One hinting is done, putback all isolated pages to the budy. You
> basically have that already.
>
>
> For 3. we can try what you have right now, using virtio. If we detect
> that's a problem, we can do it similar to what Alexander proposes and
> just do a bare hypercall. It's just a different way of carrying out the=

> same task.
>
>
> This approach
> 1. Mimics what s390x does, besides supporting different granularities.
> To synchronize guest->host we simply take the pages off the buddy.
>
> 2. Is basically what Alexander does, however his design limitation is
> that doing any hinting on smaller granularities will not work because
> there will be too many synchronous hints. Bad on fragmented guests.
>
> 3. Does not require any dynamic data structures in the guest.
>
> 4. Does not block allocation paths.
>
> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
> shouldn't it for us. We have to measure.
>
> 6. We are free to decide which granularity we report.
>
> 7. Potentially works even if the guest memory is fragmented (little
> MAX_ORDER - 1) pages.
>
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

The reason I like the current approach of reporting via separate kernel
thread is that it doesn't block any regular allocation/freeing code path
in anyways.
>
> We would have to play with how to enable/disable reporting and when to
> not report because it's not worth it in the guest (e.g. low on memory).=

>
>
> Do you think something like this would be easy to change/implement and
> measure?

I can do that as I figure out a real world guest workload using which
the two approaches can be compared.

> Thanks!
>
>> The following patch-set proposes an efficient mechanism for handing fr=
eed memory between the guest and the host. It enables the guests with no =
page cache to rapidly free and reclaims memory to and from the host respe=
ctively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system =
and single NUMA node with 15GB memory, we were able to successfully launc=
h atleast 5 guests=20
>> when page hinting was enabled and 3 without it. (Detailed explanation =
of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to captu=
re and scan the pages freed by the guest has been changed. The new approa=
ch is briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add thi=
s functionality. It maintains a per CPU array which is used to store the =
pages freed by the guest. The maximum number of entries which it can hold=
 is defined by MAX_FGPT_ENTRIES(1000). When the array is completely fille=
d, it is scanned and only the pages which are available in the buddy are =
stored. This process continues until the array is filled with pages which=
 are part of the buddy free list. After which it wakes up a kernel per-cp=
u-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-alloca=
tion and if the page is not reallocated and present in the buddy, the ker=
nel thread attempts to isolate it from the buddy. If it is successfully i=
solated, the page is added to another per-cpu array. Once the entire scan=
ning process is complete, all the isolated pages are reported to the host=
 through an existing virtio-balloon driver.
>>
>> Known Issues:
>> 	* Fixed array size: The problem with having a fixed/hardcoded array s=
ize arises when the size of the guest varies. For example when the guest =
size increases and it starts making large allocations fixed size limits t=
his solution's ability to capture all the freed pages. This will result i=
n less guest free memory getting reported to the host.
>>
>> Known code re-work:
>> 	* Plan to re-use Wei's work, which communicates the poison value to t=
he host.
>> 	* The nomenclatures used in virtio-balloon needs to be changed so tha=
t the code can easily be distinguished from Wei's Free Page Hint code.
>> 	* Sorting based on zonenum, to avoid repetitive zone locks for the sa=
me zone.
>>
>> Other required work:
>> 	* Run other benchmarks to evaluate the performance/impact of this app=
roach.
>>
>> Test case:
>> Setup:
>> Memory-15837 MB
>> Guest Memory Size-5 GB
>> Swap-Disabled
>> Test Program-Simple program which allocates 4GB memory via malloc, tou=
ches it via memset and exits.
>> Use case-Number of guests that can be launched completely including th=
e successful execution of the test program.
>> Procedure:=20
>> The first guest is launched and once its console is up, the test alloc=
ation program is executed with 4 GB memory request (Due to this the guest=
 occupies almost 4-5 GB of memory in the host in a system without page hi=
nting). Once this program exits at that time another guest is launched in=
 the host and the same process is followed. We continue launching the gue=
sts until a guest gets killed due to low memory condition in the host.
>>
>> Result:
>> Without Hinting-3 Guests
>> With Hinting-5 to 7 Guests(Based on the amount of memory freed/capture=
d).
>>
>> [1] https://www.spinics.net/lists/kvm/msg170113.html=20
>>
>>
>
--=20
Regards
Nitesh


--cW9cKcy4XGSw0A67mbAvusrJdUVpT6K3Y--

--mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEkXcoRVGaqvbHPuAGo4ZA3AYyozkFAlxq1EAACgkQo4ZA3AYy
oznNBg/9FBQq6/Y7BO+Pr9oEa3tWOX/G/VUwS5uvPHVbLYC/M6vyp+3lc3HoGGZP
Ecez6jDTUS2iwOkp2etWgYPKqfp93+UtSbGGCCEms2UI1SteTTozPSkSHOf4SAPa
+2z6XB6s0vmuCEvQoxy7LPN/vgKd7aCgDHR/GN/iueuYZzUdrHPjMGVPBmtyCh7i
sZx9b9lFW70tDH+Tyxez2LA7do/fYmDpzc3mVIENZ9si+mRVJsxwVIWlPPOga2wI
WxF2oOqiTEQixHxPKxgc97qYAR1pdspLeFWyMxY+wdWA03vVpeYFj9XJs15Mv72/
cP4OPSdFAM/XZ++4ioAFutLMuuehZYtP/8IcO4u5TLzwlsOeftnDhxbt6vjChgrd
QAoD58TpcfoMnarhdcPQmX9tRRjHKRS5IPksHufpvd8CQGIAPrPLwW1Zs2Mx9VRC
3UeRJrvD5yJmoYjJND5bkchJoJShSPeCdG1WCcM7PH+Jrtkg1w3iSVkl2T3tGVuA
XdjRuvzDazrvDiYFjgnfzatM86V/yZhY7qdM5E9U0l1IrscX1i+eyBZJUZG1DNie
VV1vdIlHQefW2hBVXGhOIHZO/APs7BrW9BVi9lPZpAGmztG1eJpbI//dsGEKK2aO
qxiB/3IQCRhdxYHXoWVsrrlcmrKgFeF/k039hC6U/E56/5EuZ9k=
=FxZ6
-----END PGP SIGNATURE-----

--mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI--