From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nitesh Narayan Lal Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Date: Mon, 18 Feb 2019 10:50:24 -0500 Message-ID: References: <20190204201854.2328-1-nitesh@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI" To: David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, Alexander Duyck Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI Content-Type: multipart/mixed; boundary="cW9cKcy4XGSw0A67mbAvusrJdUVpT6K3Y"; protected-headers="v1" From: Nitesh Narayan Lal To: David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, Alexander Duyck Message-ID: Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting References: <20190204201854.2328-1-nitesh@redhat.com> In-Reply-To: --cW9cKcy4XGSw0A67mbAvusrJdUVpT6K3Y Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Content-Language: en-US On 2/16/19 4:40 AM, David Hildenbrand wrote: > On 04.02.19 21:18, Nitesh Narayan Lal wrote: > > Hi Nitesh, > > I thought again about how s390x handles free page hinting. As that seem= s > to work just fine, I guess sticking to a similar model makes sense. > > > I already explained in this thread how it works on s390x, a short summa= ry: > > 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If = I > am not wrong, it contains 512 entries, so is exactly 1 page big. This > buffer is stored in the hypervisor and is on page granularity. > > 2. This page buffer is managed via the ESSA instruction. In addition, t= o > synchronize with the guest ("page reused when freeing in the > hypervisor"), special bits in the host->guest page table can be > set/locked via the ESSA instruction by the guest and similarly accessed= > by the hypervisor. > > 3. Once the buffer is full, the guest does a synchronous hypercall, > going over all 512 entries and zapping them (=3D=3D similar to MADV_DON= TNEED) > > > To mimic that, we > > 1. Have a static buffer per VCPU in the guest with 512 entries. You > basically have that already. > > 2. On every free, add the page _or_ the page after merging by the buddy= > (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better > than s390x). You basically have that already. > > 3. If the buffer is full, try to isolate all pages and do a synchronous= > report to the hypervisor. You have the first part already. The second > part would require a change (don't use a separate/global thread to do > the hinting, just do it synchronously). > > 4. One hinting is done, putback all isolated pages to the budy. You > basically have that already. > > > For 3. we can try what you have right now, using virtio. If we detect > that's a problem, we can do it similar to what Alexander proposes and > just do a bare hypercall. It's just a different way of carrying out the= > same task. > > > This approach > 1. Mimics what s390x does, besides supporting different granularities. > To synchronize guest->host we simply take the pages off the buddy. > > 2. Is basically what Alexander does, however his design limitation is > that doing any hinting on smaller granularities will not work because > there will be too many synchronous hints. Bad on fragmented guests. > > 3. Does not require any dynamic data structures in the guest. > > 4. Does not block allocation paths. > > 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why > shouldn't it for us. We have to measure. > > 6. We are free to decide which granularity we report. > > 7. Potentially works even if the guest memory is fragmented (little > MAX_ORDER - 1) pages. > > It would be worth a try. My feeling is that a synchronous report after > e.g. 512 frees should be acceptable, as it seems to be acceptable on > s390x. (basically always enabled, nobody complains). The reason I like the current approach of reporting via separate kernel thread is that it doesn't block any regular allocation/freeing code path in anyways. > > We would have to play with how to enable/disable reporting and when to > not report because it's not worth it in the guest (e.g. low on memory).= > > > Do you think something like this would be easy to change/implement and > measure? I can do that as I figure out a real world guest workload using which the two approaches can be compared. > Thanks! > >> The following patch-set proposes an efficient mechanism for handing fr= eed memory between the guest and the host. It enables the guests with no = page cache to rapidly free and reclaims memory to and from the host respe= ctively. >> >> Benefit: >> With this patch-series, in our test-case, executed on a single system = and single NUMA node with 15GB memory, we were able to successfully launc= h atleast 5 guests=20 >> when page hinting was enabled and 3 without it. (Detailed explanation = of the test procedure is provided at the bottom). >> >> Changelog in V8: >> In this patch-series, the earlier approach [1] which was used to captu= re and scan the pages freed by the guest has been changed. The new approa= ch is briefly described below: >> >> The patch-set still leverages the existing arch_free_page() to add thi= s functionality. It maintains a per CPU array which is used to store the = pages freed by the guest. The maximum number of entries which it can hold= is defined by MAX_FGPT_ENTRIES(1000). When the array is completely fille= d, it is scanned and only the pages which are available in the buddy are = stored. This process continues until the array is filled with pages which= are part of the buddy free list. After which it wakes up a kernel per-cp= u-thread. >> This kernel per-cpu-thread rescans the per-cpu-array for any re-alloca= tion and if the page is not reallocated and present in the buddy, the ker= nel thread attempts to isolate it from the buddy. If it is successfully i= solated, the page is added to another per-cpu array. Once the entire scan= ning process is complete, all the isolated pages are reported to the host= through an existing virtio-balloon driver. >> >> Known Issues: >> * Fixed array size: The problem with having a fixed/hardcoded array s= ize arises when the size of the guest varies. For example when the guest = size increases and it starts making large allocations fixed size limits t= his solution's ability to capture all the freed pages. This will result i= n less guest free memory getting reported to the host. >> >> Known code re-work: >> * Plan to re-use Wei's work, which communicates the poison value to t= he host. >> * The nomenclatures used in virtio-balloon needs to be changed so tha= t the code can easily be distinguished from Wei's Free Page Hint code. >> * Sorting based on zonenum, to avoid repetitive zone locks for the sa= me zone. >> >> Other required work: >> * Run other benchmarks to evaluate the performance/impact of this app= roach. >> >> Test case: >> Setup: >> Memory-15837 MB >> Guest Memory Size-5 GB >> Swap-Disabled >> Test Program-Simple program which allocates 4GB memory via malloc, tou= ches it via memset and exits. >> Use case-Number of guests that can be launched completely including th= e successful execution of the test program. >> Procedure:=20 >> The first guest is launched and once its console is up, the test alloc= ation program is executed with 4 GB memory request (Due to this the guest= occupies almost 4-5 GB of memory in the host in a system without page hi= nting). Once this program exits at that time another guest is launched in= the host and the same process is followed. We continue launching the gue= sts until a guest gets killed due to low memory condition in the host. >> >> Result: >> Without Hinting-3 Guests >> With Hinting-5 to 7 Guests(Based on the amount of memory freed/capture= d). >> >> [1] https://www.spinics.net/lists/kvm/msg170113.html=20 >> >> > --=20 Regards Nitesh --cW9cKcy4XGSw0A67mbAvusrJdUVpT6K3Y-- --mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEkXcoRVGaqvbHPuAGo4ZA3AYyozkFAlxq1EAACgkQo4ZA3AYy oznNBg/9FBQq6/Y7BO+Pr9oEa3tWOX/G/VUwS5uvPHVbLYC/M6vyp+3lc3HoGGZP Ecez6jDTUS2iwOkp2etWgYPKqfp93+UtSbGGCCEms2UI1SteTTozPSkSHOf4SAPa +2z6XB6s0vmuCEvQoxy7LPN/vgKd7aCgDHR/GN/iueuYZzUdrHPjMGVPBmtyCh7i sZx9b9lFW70tDH+Tyxez2LA7do/fYmDpzc3mVIENZ9si+mRVJsxwVIWlPPOga2wI WxF2oOqiTEQixHxPKxgc97qYAR1pdspLeFWyMxY+wdWA03vVpeYFj9XJs15Mv72/ cP4OPSdFAM/XZ++4ioAFutLMuuehZYtP/8IcO4u5TLzwlsOeftnDhxbt6vjChgrd QAoD58TpcfoMnarhdcPQmX9tRRjHKRS5IPksHufpvd8CQGIAPrPLwW1Zs2Mx9VRC 3UeRJrvD5yJmoYjJND5bkchJoJShSPeCdG1WCcM7PH+Jrtkg1w3iSVkl2T3tGVuA XdjRuvzDazrvDiYFjgnfzatM86V/yZhY7qdM5E9U0l1IrscX1i+eyBZJUZG1DNie VV1vdIlHQefW2hBVXGhOIHZO/APs7BrW9BVi9lPZpAGmztG1eJpbI//dsGEKK2aO qxiB/3IQCRhdxYHXoWVsrrlcmrKgFeF/k039hC6U/E56/5EuZ9k= =FxZ6 -----END PGP SIGNATURE----- --mdyfYfDmjN5MPMwf7kRhxeoJJoZlN6TOI--