From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9728EC48BE6 for ; Wed, 16 Jun 2021 11:57:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1C70361164 for ; Wed, 16 Jun 2021 11:57:47 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1C70361164 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AE8756B006C; Wed, 16 Jun 2021 07:57:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A859E6B006E; Wed, 16 Jun 2021 07:57:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8FF6E6B0070; Wed, 16 Jun 2021 07:57:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0160.hostedemail.com [216.40.44.160]) by kanga.kvack.org (Postfix) with ESMTP id 61CE16B006C for ; Wed, 16 Jun 2021 07:57:46 -0400 (EDT) Received: from smtpin39.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 0315DB286 for ; Wed, 16 Jun 2021 11:57:46 +0000 (UTC) X-FDA: 78259437732.39.1A6446A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf17.hostedemail.com (Postfix) with ESMTP id 1C72F4202A0E for ; Wed, 16 Jun 2021 11:57:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623844664; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qUCoqrykrXq2rtHC/WAScmuIiIeoVKGPNWkQaMWprDY=; b=OjTlcLp+SBwI89mnvzunL4dLkGmJvEHr6MyT2WTrOxDv0HCI42GMLJR6f8FJi9VkKCL9m/ i90LWTLw0IEA/Z8mRptS1NYIjf4TlECYgAmG0QR8+Cfn1JJMBv9/7WnnmQ/nDcT7HdcmJk sgFxRxnxeR/iXjWHrdHuBvY8My68lnk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-269-8iy1PNlmMEyXfMJZeesJUA-1; Wed, 16 Jun 2021 07:57:41 -0400 X-MC-Unique: 8iy1PNlmMEyXfMJZeesJUA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 245D3101258E; Wed, 16 Jun 2021 11:57:40 +0000 (UTC) Received: from [10.64.54.84] (vpn2-54-84.bne.redhat.com [10.64.54.84]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2DEA060C04; Wed, 16 Jun 2021 11:57:26 +0000 (UTC) Reply-To: Gavin Shan Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER To: David Hildenbrand , Alexander Duyck Cc: linux-mm , LKML , Andrew Morton , shan.gavin@gmail.com, Anshuman Khandual References: <20210601033319.100737-1-gshan@redhat.com> <76516781-6a70-f2b0-f3e3-da999c84350f@redhat.com> <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com> <63c06446-3b10-762c-3a29-464854b74e08@redhat.com> <0cb302f1-7fb6-e47c-e138-b7a03f2b02e2@redhat.com> <33b441b2-f10d-a7fb-8163-df2afbf6527d@redhat.com> <9e553b30-ce18-df65-bd3c-c68eaa4d0d91@redhat.com> From: Gavin Shan Message-ID: <3adbcad8-1016-cf48-4574-799de0bba6e4@redhat.com> Date: Wed, 16 Jun 2021 23:58:28 +1000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.0 MIME-Version: 1.0 In-Reply-To: <9e553b30-ce18-df65-bd3c-c68eaa4d0d91@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OjTlcLp+; spf=none (imf17.hostedemail.com: domain of gshan@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=gshan@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: me5wjqyukmz7cu9ryn1kgsjqi17oj8go X-Rspamd-Queue-Id: 1C72F4202A0E X-Rspamd-Server: rspam06 X-HE-Tag: 1623844657-727376 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 6/16/21 9:20 PM, David Hildenbrand wrote: > On 16.06.21 15:16, Gavin Shan wrote: >> On 6/16/21 6:03 PM, David Hildenbrand wrote: >>> On 16.06.21 11:10, Gavin Shan wrote: >>>> On 6/15/21 12:26 PM, Alexander Duyck wrote: >>>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand wrote: >>>>>> On 11.06.21 09:44, Gavin Shan wrote: >>>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote: >>>>>>>> On 01.06.21 05:33, Gavin Shan wrote: >>>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, take= n as >>>>>>>>> minimal order (threshold) to trigger page reporting. The page r= eporting >>>>>>>>> is never triggered with the following configurations and settin= gs on >>>>>>>>> aarch64. In the particular scenario, the page reporting won't b= e triggered >>>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved fro= m the >>>>>>>>> page freeing. The condition is very hard, or even impossible to= be met. >>>>>>>>> >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_ARM64_PAGE_SHIFT:=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= 16 >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE:=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 Y >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE_SIZE_V= ARIABLE:=C2=A0=C2=A0=C2=A0 N >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pageblock_order:=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 13 >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER= :=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 MAX_ORDER:=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >>>>>>>>> >>>>>>>>> The issue can be reproduced in VM, running kernel with above co= nfigurations >>>>>>>>> and settings. The 'memhog' is used inside the VM to access 512M= B anonymous >>>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' ex= its. >>>>>>>>> >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /home/gavin/sandbox/qemu.m= ain/build/qemu-system-aarch64=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -accel kvm -machine virt,g= ic-version=3Dhost=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -cpu host -smp 8,sockets=3D= 2,cores=3D4,threads=3D1 -m 4096M,maxmem=3D64G \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram= ,id=3Dmem0,size=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram= ,id=3Dmem1,size=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D0,cpus= =3D0-3,memdev=3Dmem0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D1,cpus= =3D4-7,memdev=3Dmem1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 :=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -device virtio-balloon-pci= ,id=3Dballoon0,free-page-reporting=3Dyes >>>>>>>>> >>>>>>>>> This tries to fix the issue by adjusting the threshold to the s= maller value >>>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the Q= EMU's RSS >>>>>>>>> drops after 'memhog' exits. >>>>>>>> >>>>>>>> IIRC, we use pageblock_order to >>>>>>>> >>>>>>>> a) Reduce the free page reporting overhead. Reporting on small c= hunks can make us report constantly with little system activity. >>>>>>>> >>>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM= performance. >>>>>>>> >>>>>>>> c) Avoid affecting creation of pageblock_order pages while hinti= ng is active. I think there are cases where "temporary pulling sub-pagebl= ock pages" can negatively affect creation of pageblock_order pages. Concu= rrent compaction would be one of these cases. >>>>>>>> >>>>>>>> The monstrosity called aarch64 64k is really special in that sen= se, because a) does not apply because pageblocks are just very big, b) do= es sometimes not apply because either our VM isn't backed by (rare) 512MB= THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish V= Ms because we don't really happen to create 512MB THP either way. >>>>>>>> >>>>>>>> >>>>>>>> For example, going on x86-64 from reporting 2MB to something lik= e 32KB is absolutely undesired. >>>>>>>> >>>>>>>> I think if we want to go down that path (and I am not 100% sure = yet if we want to), we really want to treat only the special case in a sp= ecial way. Note that even when doing it only for aarch64 with 64k, you wi= ll still end up splitting THP in a hypervisor if it uses 64k base pages (= b)) and can affect creation of THP, for example, when compacting (c), so = there is a negative side to that. >>>>>>>> >>>>>>> >>>>>>> [Remove Alexander from the cc list as his mail isn't reachable] >>>>>>> >>>>>> >>>>>> [adding his gmail address which should be the right one] >>>>>> >>>>>>> David, thanks for your time to review and sorry for the delay and= late response. >>>>>>> I spent some time to get myself familiar with the code, but there= are still some >>>>>>> questions to me, explained as below. >>>>>>> >>>>>>> Yes, @pageblock_order is currently taken as page reporting thresh= old. It will >>>>>>> incur more overhead if the threshold is decreased as you said in = (a). >>>>>> >>>>>> Right. Alex did quite some performance/overhead evaluation when >>>>>> introducing this feature. Changing the reporting granularity on mo= st >>>>>> setups (esp., x86-64) is not desired IMHO. >>>>> >>>>> Yes, generally reporting pages comes at a fairly high cost so it is >>>>> important to find the right trade-off between the size of the page = and >>>>> the size of the batch of pages being reported. If the size of the >>>>> pages is reduced it maybe important to increase the batch size in >>>>> order to avoid paying too much in the way of overhead. >>>>> >>>>> The other main reason for holding to pageblock_order on x86 is to >>>>> avoid THP splitting. Anything smaller than pageblock_order will >>>>> trigger THP splitting which will significantly hurt the performance= of >>>>> the VM in general as it forces it down to order 0 pages. >>>>> >>>> >>>> Alex, Thanks for your reply and sorry for taking your time to this >>>> discussion. >>>> >>>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget >>>> used in page_reporting_cycle() when you're talking about "batch"? >>>> >>>> I don't understand how the THP splitting is triggered. As I understo= od, >>>> the free page reporting works like this: the free pages are reported >>>> back to QEMU through virtio-balloon, madvise(DONTNEED) is called on = the >>>> free pages. For THP related VMA, its memory is guranteed to be physi= cally >>>> 2MB contiguous. The 2MB memory is free'd at once as I understand. I >>>> don't think the 2MB memory can be free'd partially apart from the TH= P >>>> splitting caused by some reasons. One of the reason is memory reclai= m >>>> and swapping. However, I think the THP splitting is caused by memory >>>> reclaim is irrelevant to what you were saying. >>> >>> >>> Anonymous memory: Assume you have a 2 MB THP in the hypervisor. If yo= u madvise(DONTNEED, 4K), you'll split the THP and free the single 4k page= back to the buddy. That virtual memory region is no longer backed by a p= hysically contiguous 2 MB page and there is no huge mapping in the page t= ables. Instead, there is now an "ordinary" 4k mapping with a hole. Bad fo= r performance. >>> >> >> Ok, now I see and thanks for your explanation. In this regard, we need= gurantee >> the page reporting threshold is larger or equal to THP size. The THP s= ize is 2MB >> or 512MB if base page size is 4KB or 64KB. >> >> It makes the issue hard to be fixed as we have 512MB THP size with 64K= B base page >> size on arm64. The following configurations are used in this case. >> >> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >> =C2=A0=C2=A0=C2=A0=C2=A0 MAX_ORDER=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >> =C2=A0=C2=A0=C2=A0=C2=A0 pageblock_order=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 13 >> >> The free page reporting won't be started until the page freeing comes = up with 512MB >> free area. On system, which has limited memory (e.g. 4GB), 512MB free = area is hard >> to have due to memory fragmentation. >> >=20 > FWIW, in an ideal world the hypervisor would tell us (guest) which gran= ularity it prefers. Could be that the hypervisor is using a different pag= e size / thp size ... but that's a different story :) >=20 > From a guest POV, it usually makes sense to report only whole pagebloc= ks. But as identified, this is an issue with abnormally large pageblocks. >=20 > 512MB pageblocks / THP is simply far from ideal for VMs, kindly phrased= . :) >=20 Indeed. 512MB pageblocks are rare, especially on systems which have been up and running for long time. The free page reporting starts from guest. Taking an extreme case: guest = has 512MB memory and it's backed by one THP on host. The free page reporting = won't work at all. Besides, it seems free page reporting isn't guranteed to work all the tim= e. For example, on system where we have 4KB base page size. Freeing individu= al 4KB pages can't come up with a free 2MB pageblock due to fragmentation. In this case, the free'd page won't be reported immediately, but might be reported after swapping or compaction due to memory pressure. The free pa= ge isn't reported immediately at least. David, how about taking your suggestion to have different threshold size = only for arm64 (64KB base page size). The threshold will be smaller than pageb= lock_order for sure. There are two ways to do so and please let me know which is the= preferred way to go if you (and Alex) agree to do it. (a) Introduce CONFIG_PAGE_REPORTING_ORDER for individual archs to choose = the value. The threshold falls back to pageblock_order if isn't configur= ated. (b) Rename PAGE_REPORTING_MIN_ORDER to PAGE_REPORTING_ORDER. archs can de= cide its value. If it's not provided by arch, it falls back to pageblock_= order. By the way, I recently had some performance testing on different page siz= es. We get much more performance gain from 64KB (vs 4KB) page size in guest t= han 512MB (vs 2MB) THP on host. It means the performance won't be affected to= o much even the 512MB THP is splitted on arm64 host. Thanks, Gavin