From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E921C48BE5 for ; Wed, 16 Jun 2021 11:20:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D551661351 for ; Wed, 16 Jun 2021 11:20:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D551661351 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7577F6B0070; Wed, 16 Jun 2021 07:20:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6DFD26B0071; Wed, 16 Jun 2021 07:20:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 50AA36B0072; Wed, 16 Jun 2021 07:20:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204]) by kanga.kvack.org (Postfix) with ESMTP id 18F7E6B0070 for ; Wed, 16 Jun 2021 07:20:11 -0400 (EDT) Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id ACA9B10F86 for ; Wed, 16 Jun 2021 11:20:10 +0000 (UTC) X-FDA: 78259342980.34.98083DE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf15.hostedemail.com (Postfix) with ESMTP id C8636A00025F for ; Wed, 16 Jun 2021 11:20:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623842408; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ANd/xLMM9lCz5M4YffhnPjBbHqFfAxIoeWO+GHOf7i8=; b=YM9So9mb+kx3IfxiU8slDGpw7a23btsRZ6Q2Qdik9k35rPtYmtc/DRaLvb2aZ13oZrFZKp 7MybmCEM5Iii94E2If6GYdorbw4xrwyXh4FGwWifErl+ZEevqQ29Na1pMrsGdDRSzFGb+M iIYhrST9mATS8fB/71imdHDWtBru0WA= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-501-VogDEPZpO-6vGfGHJ9C49Q-1; Wed, 16 Jun 2021 07:20:07 -0400 X-MC-Unique: VogDEPZpO-6vGfGHJ9C49Q-1 Received: by mail-wr1-f72.google.com with SMTP id j2-20020a5d61820000b029011a6a8149b5so531117wru.14 for ; Wed, 16 Jun 2021 04:20:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=ANd/xLMM9lCz5M4YffhnPjBbHqFfAxIoeWO+GHOf7i8=; b=n6IoF4022NIYGsj1Cf76Duj8ewmCvqCuqiqAR3ywGJpbPR0fJCjl1eSVv3TQuiIW7Q rsnacjupugrW9R9YjKwHAaxpcat//xYF+uD1Xcod3Zfy/yZCWWx2BTmm0tauPj9OQXsb AGt9hOf951Qot6I3K2VRiG00d+PyuIJGX43V22N4+h11fWngGkFReFWymDNfLodnvkLh 6d9YnO05oOwtkK0cYwL03YT2k3fqN4jqp+HucmFUUbswnugBbk1eXsrCpDFghvh4r2f6 iLxIzGyNT7gLlg5qobHATMGX1WTlRRNy5V+MM4kwNDPOMwvTE7aBnlU9k15UwcLW2Uqe MobQ== X-Gm-Message-State: AOAM5310IACm6rnq6d/5fwIK+KyChWhD60ywfyL8E0kH+epzlcUlOcO0 7WUAMcsieT0txExrlMe6q47zB+ZfYbSxcAdcAC2KBKT3akKnTKqh+I9zCq1N+ZmHG6LKrT4t2fL XJa5OtrMXn+I= X-Received: by 2002:a1c:4b0d:: with SMTP id y13mr4577888wma.179.1623842406558; Wed, 16 Jun 2021 04:20:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz/9CpSUl6vrzN0eL2RETMc5CZTFPwp9+JNtN5aPS9mKIm0U6fbLONuU78ZN0Er5/lVITE0uw== X-Received: by 2002:a1c:4b0d:: with SMTP id y13mr4577858wma.179.1623842406275; Wed, 16 Jun 2021 04:20:06 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6524.dip0.t-ipconnect.de. [91.12.101.36]) by smtp.gmail.com with ESMTPSA id y26sm1590216wma.33.2021.06.16.04.20.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 16 Jun 2021 04:20:06 -0700 (PDT) Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER To: Gavin Shan , Alexander Duyck Cc: linux-mm , LKML , Andrew Morton , shan.gavin@gmail.com, Anshuman Khandual References: <20210601033319.100737-1-gshan@redhat.com> <76516781-6a70-f2b0-f3e3-da999c84350f@redhat.com> <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com> <63c06446-3b10-762c-3a29-464854b74e08@redhat.com> <0cb302f1-7fb6-e47c-e138-b7a03f2b02e2@redhat.com> <33b441b2-f10d-a7fb-8163-df2afbf6527d@redhat.com> From: David Hildenbrand Organization: Red Hat Message-ID: <9e553b30-ce18-df65-bd3c-c68eaa4d0d91@redhat.com> Date: Wed, 16 Jun 2021 13:20:05 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: <33b441b2-f10d-a7fb-8163-df2afbf6527d@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YM9So9mb; spf=none (imf15.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: C8636A00025F X-Stat-Signature: tmge6mxz1oha3mxke1f9wr4bjya7bmsu X-HE-Tag: 1623842404-539463 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 16.06.21 15:16, Gavin Shan wrote: > On 6/16/21 6:03 PM, David Hildenbrand wrote: >> On 16.06.21 11:10, Gavin Shan wrote: >>> On 6/15/21 12:26 PM, Alexander Duyck wrote: >>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand = wrote: >>>>> On 11.06.21 09:44, Gavin Shan wrote: >>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote: >>>>>>> On 01.06.21 05:33, Gavin Shan wrote: >>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken= as >>>>>>>> minimal order (threshold) to trigger page reporting. The page re= porting >>>>>>>> is never triggered with the following configurations and setting= s on >>>>>>>> aarch64. In the particular scenario, the page reporting won't be= triggered >>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from= the >>>>>>>> page freeing. The condition is very hard, or even impossible to = be met. >>>>>>>> >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_ARM64_PAGE_SHIFT:=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 16 >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE:=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 Y >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE_SIZE_VARIABL= E:=C2=A0=C2=A0=C2=A0 N >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pageblock_order:=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 13 >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER:=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 MAX_ORDER:=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >>>>>>>> >>>>>>>> The issue can be reproduced in VM, running kernel with above con= figurations >>>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB= anonymous >>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exi= ts. >>>>>>>> >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /home/gavin/sandbox/qemu.main/bu= ild/qemu-system-aarch64=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -accel kvm -machine virt,gic-ver= sion=3Dhost=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -cpu host -smp 8,sockets=3D2,cor= es=3D4,threads=3D1 -m 4096M,maxmem=3D64G \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3D= mem0,size=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3D= mem1,size=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D0,cpus=3D0-3= ,memdev=3Dmem0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D1,cpus=3D4-7= ,memdev=3Dmem1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 :=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -device virtio-balloon-pci,id=3D= balloon0,free-page-reporting=3Dyes >>>>>>>> >>>>>>>> This tries to fix the issue by adjusting the threshold to the sm= aller value >>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QE= MU's RSS >>>>>>>> drops after 'memhog' exits. >>>>>>> >>>>>>> IIRC, we use pageblock_order to >>>>>>> >>>>>>> a) Reduce the free page reporting overhead. Reporting on small ch= unks can make us report constantly with little system activity. >>>>>>> >>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM = performance. >>>>>>> >>>>>>> c) Avoid affecting creation of pageblock_order pages while hintin= g is active. I think there are cases where "temporary pulling sub-pageblo= ck pages" can negatively affect creation of pageblock_order pages. Concur= rent compaction would be one of these cases. >>>>>>> >>>>>>> The monstrosity called aarch64 64k is really special in that sens= e, because a) does not apply because pageblocks are just very big, b) doe= s sometimes not apply because either our VM isn't backed by (rare) 512MB = THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VM= s because we don't really happen to create 512MB THP either way. >>>>>>> >>>>>>> >>>>>>> For example, going on x86-64 from reporting 2MB to something like= 32KB is absolutely undesired. >>>>>>> >>>>>>> I think if we want to go down that path (and I am not 100% sure y= et if we want to), we really want to treat only the special case in a spe= cial way. Note that even when doing it only for aarch64 with 64k, you wil= l still end up splitting THP in a hypervisor if it uses 64k base pages (b= )) and can affect creation of THP, for example, when compacting (c), so t= here is a negative side to that. >>>>>>> >>>>>> >>>>>> [Remove Alexander from the cc list as his mail isn't reachable] >>>>>> >>>>> >>>>> [adding his gmail address which should be the right one] >>>>> >>>>>> David, thanks for your time to review and sorry for the delay and = late response. >>>>>> I spent some time to get myself familiar with the code, but there = are still some >>>>>> questions to me, explained as below. >>>>>> >>>>>> Yes, @pageblock_order is currently taken as page reporting thresho= ld. It will >>>>>> incur more overhead if the threshold is decreased as you said in (= a). >>>>> >>>>> Right. Alex did quite some performance/overhead evaluation when >>>>> introducing this feature. Changing the reporting granularity on mos= t >>>>> setups (esp., x86-64) is not desired IMHO. >>>> >>>> Yes, generally reporting pages comes at a fairly high cost so it is >>>> important to find the right trade-off between the size of the page a= nd >>>> the size of the batch of pages being reported. If the size of the >>>> pages is reduced it maybe important to increase the batch size in >>>> order to avoid paying too much in the way of overhead. >>>> >>>> The other main reason for holding to pageblock_order on x86 is to >>>> avoid THP splitting. Anything smaller than pageblock_order will >>>> trigger THP splitting which will significantly hurt the performance = of >>>> the VM in general as it forces it down to order 0 pages. >>>> >>> >>> Alex, Thanks for your reply and sorry for taking your time to this >>> discussion. >>> >>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget >>> used in page_reporting_cycle() when you're talking about "batch"? >>> >>> I don't understand how the THP splitting is triggered. As I understoo= d, >>> the free page reporting works like this: the free pages are reported >>> back to QEMU through virtio-balloon, madvise(DONTNEED) is called on t= he >>> free pages. For THP related VMA, its memory is guranteed to be physic= ally >>> 2MB contiguous. The 2MB memory is free'd at once as I understand. I >>> don't think the 2MB memory can be free'd partially apart from the THP >>> splitting caused by some reasons. One of the reason is memory reclaim >>> and swapping. However, I think the THP splitting is caused by memory >>> reclaim is irrelevant to what you were saying. >> >> >> Anonymous memory: Assume you have a 2 MB THP in the hypervisor. If you= madvise(DONTNEED, 4K), you'll split the THP and free the single 4k page = back to the buddy. That virtual memory region is no longer backed by a ph= ysically contiguous 2 MB page and there is no huge mapping in the page ta= bles. Instead, there is now an "ordinary" 4k mapping with a hole. Bad for= performance. >> >=20 > Ok, now I see and thanks for your explanation. In this regard, we need = gurantee > the page reporting threshold is larger or equal to THP size. The THP si= ze is 2MB > or 512MB if base page size is 4KB or 64KB. >=20 > It makes the issue hard to be fixed as we have 512MB THP size with 64KB= base page > size on arm64. The following configurations are used in this case. >=20 > CONFIG_FORCE_MAX_ZONEORDER 14 > MAX_ORDER 14 > pageblock_order 13 >=20 > The free page reporting won't be started until the page freeing comes u= p with 512MB > free area. On system, which has limited memory (e.g. 4GB), 512MB free a= rea is hard > to have due to memory fragmentation. >=20 FWIW, in an ideal world the hypervisor would tell us (guest) which=20 granularity it prefers. Could be that the hypervisor is using a=20 different page size / thp size ... but that's a different story :) From a guest POV, it usually makes sense to report only whole=20 pageblocks. But as identified, this is an issue with abnormally large=20 pageblocks. 512MB pageblocks / THP is simply far from ideal for VMs, kindly phrased. = :) --=20 Thanks, David / dhildenb