From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=usud=LK=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3E921C48BE5
	for <linux-mm@archiver.kernel.org>; Wed, 16 Jun 2021 11:20:12 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D551661351
	for <linux-mm@archiver.kernel.org>; Wed, 16 Jun 2021 11:20:11 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D551661351
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 7577F6B0070; Wed, 16 Jun 2021 07:20:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6DFD26B0071; Wed, 16 Jun 2021 07:20:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 50AA36B0072; Wed, 16 Jun 2021 07:20:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204])
	by kanga.kvack.org (Postfix) with ESMTP id 18F7E6B0070
	for <linux-mm@kvack.org>; Wed, 16 Jun 2021 07:20:11 -0400 (EDT)
Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id ACA9B10F86
	for <linux-mm@kvack.org>; Wed, 16 Jun 2021 11:20:10 +0000 (UTC)
X-FDA: 78259342980.34.98083DE
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf15.hostedemail.com (Postfix) with ESMTP id C8636A00025F
	for <linux-mm@kvack.org>; Wed, 16 Jun 2021 11:20:04 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1623842408;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ANd/xLMM9lCz5M4YffhnPjBbHqFfAxIoeWO+GHOf7i8=;
	b=YM9So9mb+kx3IfxiU8slDGpw7a23btsRZ6Q2Qdik9k35rPtYmtc/DRaLvb2aZ13oZrFZKp
	7MybmCEM5Iii94E2If6GYdorbw4xrwyXh4FGwWifErl+ZEevqQ29Na1pMrsGdDRSzFGb+M
	iIYhrST9mATS8fB/71imdHDWtBru0WA=
Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com
 [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-501-VogDEPZpO-6vGfGHJ9C49Q-1; Wed, 16 Jun 2021 07:20:07 -0400
X-MC-Unique: VogDEPZpO-6vGfGHJ9C49Q-1
Received: by mail-wr1-f72.google.com with SMTP id j2-20020a5d61820000b029011a6a8149b5so531117wru.14
        for <linux-mm@kvack.org>; Wed, 16 Jun 2021 04:20:07 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:subject:to:cc:references:from:organization
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=ANd/xLMM9lCz5M4YffhnPjBbHqFfAxIoeWO+GHOf7i8=;
        b=n6IoF4022NIYGsj1Cf76Duj8ewmCvqCuqiqAR3ywGJpbPR0fJCjl1eSVv3TQuiIW7Q
         rsnacjupugrW9R9YjKwHAaxpcat//xYF+uD1Xcod3Zfy/yZCWWx2BTmm0tauPj9OQXsb
         AGt9hOf951Qot6I3K2VRiG00d+PyuIJGX43V22N4+h11fWngGkFReFWymDNfLodnvkLh
         6d9YnO05oOwtkK0cYwL03YT2k3fqN4jqp+HucmFUUbswnugBbk1eXsrCpDFghvh4r2f6
         iLxIzGyNT7gLlg5qobHATMGX1WTlRRNy5V+MM4kwNDPOMwvTE7aBnlU9k15UwcLW2Uqe
         MobQ==
X-Gm-Message-State: AOAM5310IACm6rnq6d/5fwIK+KyChWhD60ywfyL8E0kH+epzlcUlOcO0
	7WUAMcsieT0txExrlMe6q47zB+ZfYbSxcAdcAC2KBKT3akKnTKqh+I9zCq1N+ZmHG6LKrT4t2fL
	XJa5OtrMXn+I=
X-Received: by 2002:a1c:4b0d:: with SMTP id y13mr4577888wma.179.1623842406558;
        Wed, 16 Jun 2021 04:20:06 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJz/9CpSUl6vrzN0eL2RETMc5CZTFPwp9+JNtN5aPS9mKIm0U6fbLONuU78ZN0Er5/lVITE0uw==
X-Received: by 2002:a1c:4b0d:: with SMTP id y13mr4577858wma.179.1623842406275;
        Wed, 16 Jun 2021 04:20:06 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c6524.dip0.t-ipconnect.de. [91.12.101.36])
        by smtp.gmail.com with ESMTPSA id y26sm1590216wma.33.2021.06.16.04.20.05
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 16 Jun 2021 04:20:06 -0700 (PDT)
Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to
 MAX_ORDER
To: Gavin Shan <gshan@redhat.com>, Alexander Duyck <alexander.duyck@gmail.com>
Cc: linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>,
 Andrew Morton <akpm@linux-foundation.org>, shan.gavin@gmail.com,
 Anshuman Khandual <anshuman.khandual@arm.com>
References: <20210601033319.100737-1-gshan@redhat.com>
 <76516781-6a70-f2b0-f3e3-da999c84350f@redhat.com>
 <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com>
 <b45b26ea-a6ac-934c-2467-c6e829b5d3ad@redhat.com>
 <CAKgT0Ue9SQ8=ju1m6ftKTb4Tai9EJ5NQhnB_uk-DzMc19-R4cQ@mail.gmail.com>
 <63c06446-3b10-762c-3a29-464854b74e08@redhat.com>
 <0cb302f1-7fb6-e47c-e138-b7a03f2b02e2@redhat.com>
 <33b441b2-f10d-a7fb-8163-df2afbf6527d@redhat.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Message-ID: <9e553b30-ce18-df65-bd3c-c68eaa4d0d91@redhat.com>
Date: Wed, 16 Jun 2021 13:20:05 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.1
MIME-Version: 1.0
In-Reply-To: <33b441b2-f10d-a7fb-8163-df2afbf6527d@redhat.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YM9So9mb;
	spf=none (imf15.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: C8636A00025F
X-Stat-Signature: tmge6mxz1oha3mxke1f9wr4bjya7bmsu
X-HE-Tag: 1623842404-539463
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 16.06.21 15:16, Gavin Shan wrote:
> On 6/16/21 6:03 PM, David Hildenbrand wrote:
>> On 16.06.21 11:10, Gavin Shan wrote:
>>> On 6/15/21 12:26 PM, Alexander Duyck wrote:
>>>> On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@redhat.com>=
 wrote:
>>>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken=
 as
>>>>>>>> minimal order (threshold) to trigger page reporting. The page re=
porting
>>>>>>>> is never triggered with the following configurations and setting=
s on
>>>>>>>> aarch64. In the particular scenario, the page reporting won't be=
 triggered
>>>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from=
 the
>>>>>>>> page freeing. The condition is very hard, or even impossible to =
be met.
>>>>>>>>
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_ARM64_PAGE_SHIFT:=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 16
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE:=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 Y
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE_SIZE_VARIABL=
E:=C2=A0=C2=A0=C2=A0 N
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pageblock_order:=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 13
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER:=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 MAX_ORDER:=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>>>>>>
>>>>>>>> The issue can be reproduced in VM, running kernel with above con=
figurations
>>>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB=
 anonymous
>>>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exi=
ts.
>>>>>>>>
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /home/gavin/sandbox/qemu.main/bu=
ild/qemu-system-aarch64=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -accel kvm -machine virt,gic-ver=
sion=3Dhost=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -cpu host -smp 8,sockets=3D2,cor=
es=3D4,threads=3D1 -m 4096M,maxmem=3D64G \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3D=
mem0,size=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3D=
mem1,size=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D0,cpus=3D0-3=
,memdev=3Dmem0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D1,cpus=3D4-7=
,memdev=3Dmem1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 :=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -device virtio-balloon-pci,id=3D=
balloon0,free-page-reporting=3Dyes
>>>>>>>>
>>>>>>>> This tries to fix the issue by adjusting the threshold to the sm=
aller value
>>>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QE=
MU's RSS
>>>>>>>> drops after 'memhog' exits.
>>>>>>>
>>>>>>> IIRC, we use pageblock_order to
>>>>>>>
>>>>>>> a) Reduce the free page reporting overhead. Reporting on small ch=
unks can make us report constantly with little system activity.
>>>>>>>
>>>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM =
performance.
>>>>>>>
>>>>>>> c) Avoid affecting creation of pageblock_order pages while hintin=
g is active. I think there are cases where "temporary pulling sub-pageblo=
ck pages" can negatively affect creation of pageblock_order pages. Concur=
rent compaction would be one of these cases.
>>>>>>>
>>>>>>> The monstrosity called aarch64 64k is really special in that sens=
e, because a) does not apply because pageblocks are just very big, b) doe=
s sometimes not apply because either our VM isn't backed by (rare) 512MB =
THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VM=
s because we don't really happen to create 512MB THP either way.
>>>>>>>
>>>>>>>
>>>>>>> For example, going on x86-64 from reporting 2MB to something like=
 32KB is absolutely undesired.
>>>>>>>
>>>>>>> I think if we want to go down that path (and I am not 100% sure y=
et if we want to), we really want to treat only the special case in a spe=
cial way. Note that even when doing it only for aarch64 with 64k, you wil=
l still end up splitting THP in a hypervisor if it uses 64k base pages (b=
)) and can affect creation of THP, for example, when compacting (c), so t=
here is a negative side to that.
>>>>>>>
>>>>>>
>>>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>>>
>>>>>
>>>>> [adding his gmail address which should be the right one]
>>>>>
>>>>>> David, thanks for your time to review and sorry for the delay and =
late response.
>>>>>> I spent some time to get myself familiar with the code, but there =
are still some
>>>>>> questions to me, explained as below.
>>>>>>
>>>>>> Yes, @pageblock_order is currently taken as page reporting thresho=
ld. It will
>>>>>> incur more overhead if the threshold is decreased as you said in (=
a).
>>>>>
>>>>> Right. Alex did quite some performance/overhead evaluation when
>>>>> introducing this feature. Changing the reporting granularity on mos=
t
>>>>> setups (esp., x86-64) is not desired IMHO.
>>>>
>>>> Yes, generally reporting pages comes at a fairly high cost so it is
>>>> important to find the right trade-off between the size of the page a=
nd
>>>> the size of the batch of pages being reported. If the size of the
>>>> pages is reduced it maybe important to increase the batch size in
>>>> order to avoid paying too much in the way of overhead.
>>>>
>>>> The other main reason for holding to pageblock_order on x86 is to
>>>> avoid THP splitting. Anything smaller than pageblock_order will
>>>> trigger THP splitting which will significantly hurt the performance =
of
>>>> the VM in general as it forces it down to order 0 pages.
>>>>
>>>
>>> Alex, Thanks for your reply and sorry for taking your time to this
>>> discussion.
>>>
>>> Could you please confirm it's PAGE_REPORTING_CAPACITY or the budget
>>> used in page_reporting_cycle() when you're talking about "batch"?
>>>
>>> I don't understand how the THP splitting is triggered. As I understoo=
d,
>>> the free page reporting works like this: the free pages are reported
>>> back to QEMU through virtio-balloon, madvise(DONTNEED) is called on t=
he
>>> free pages. For THP related VMA, its memory is guranteed to be physic=
ally
>>> 2MB contiguous. The 2MB memory is free'd at once as I understand. I
>>> don't think the 2MB memory can be free'd partially apart from the THP
>>> splitting caused by some reasons. One of the reason is memory reclaim
>>> and swapping. However, I think the THP splitting is caused by memory
>>> reclaim is irrelevant to what you were saying.
>>
>>
>> Anonymous memory: Assume you have a 2 MB THP in the hypervisor. If you=
 madvise(DONTNEED, 4K), you'll split the THP and free the single 4k page =
back to the buddy. That virtual memory region is no longer backed by a ph=
ysically contiguous 2 MB page and there is no huge mapping in the page ta=
bles. Instead, there is now an "ordinary" 4k mapping with a hole. Bad for=
 performance.
>>
>=20
> Ok, now I see and thanks for your explanation. In this regard, we need =
gurantee
> the page reporting threshold is larger or equal to THP size. The THP si=
ze is 2MB
> or 512MB if base page size is 4KB or 64KB.
>=20
> It makes the issue hard to be fixed as we have 512MB THP size with 64KB=
 base page
> size on arm64. The following configurations are used in this case.
>=20
>      CONFIG_FORCE_MAX_ZONEORDER          14
>      MAX_ORDER                           14
>      pageblock_order                     13
>=20
> The free page reporting won't be started until the page freeing comes u=
p with 512MB
> free area. On system, which has limited memory (e.g. 4GB), 512MB free a=
rea is hard
> to have due to memory fragmentation.
>=20

FWIW, in an ideal world the hypervisor would tell us (guest) which=20
granularity it prefers. Could be that the hypervisor is using a=20
different page size / thp size ... but that's a different story :)

 From a guest POV, it usually makes sense to report only whole=20
pageblocks. But as identified, this is an issue with abnormally large=20
pageblocks.

512MB pageblocks / THP is simply far from ideal for VMs, kindly phrased. =
:)

--=20
Thanks,

David / dhildenb