From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=usud=LK=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E1073C48BE5
	for <linux-mm@archiver.kernel.org>; Wed, 16 Jun 2021 10:58:42 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 713E761042
	for <linux-mm@archiver.kernel.org>; Wed, 16 Jun 2021 10:58:42 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 713E761042
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id F37496B0070; Wed, 16 Jun 2021 06:58:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id ED3946B0071; Wed, 16 Jun 2021 06:58:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D39926B0072; Wed, 16 Jun 2021 06:58:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50])
	by kanga.kvack.org (Postfix) with ESMTP id 9C4726B0070
	for <linux-mm@kvack.org>; Wed, 16 Jun 2021 06:58:41 -0400 (EDT)
Received: from smtpin39.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 2742710F56
	for <linux-mm@kvack.org>; Wed, 16 Jun 2021 10:58:41 +0000 (UTC)
X-FDA: 78259288842.39.FC79383
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf30.hostedemail.com (Postfix) with ESMTP id 16DC1E000243
	for <linux-mm@kvack.org>; Wed, 16 Jun 2021 10:58:33 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1623841119;
	h=from:from:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=3to17QVrTqCUHQKHYC721zh9N8G91Zje9XR33PdSe6g=;
	b=QDuNl616DuVnTVTsexNmXn+iWQ1bBPmjLHvaJklwdJbvNFAPEQJDaYda7ravY7+fok/FWq
	+VXM4tPRn+D8olhXVPtwqNk6rj5YbWsFZdTXD438NqXG+xHS+Yj8SMH0quBJ+2+sAVYIye
	IntZPOUH7B4DuyADtb4HuyWVgdt2jkU=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-68-Ou8m_Eu3M-OBTpGlU8yPIw-1; Wed, 16 Jun 2021 06:58:38 -0400
X-MC-Unique: Ou8m_Eu3M-OBTpGlU8yPIw-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 0F1E7107ACF6;
	Wed, 16 Jun 2021 10:58:37 +0000 (UTC)
Received: from [10.64.54.84] (vpn2-54-84.bne.redhat.com [10.64.54.84])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 8C2CA10023B5;
	Wed, 16 Jun 2021 10:58:30 +0000 (UTC)
Reply-To: Gavin Shan <gshan@redhat.com>
Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to
 MAX_ORDER
To: David Hildenbrand <david@redhat.com>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
 shan.gavin@gmail.com, Anshuman Khandual <anshuman.khandual@arm.com>,
 Alexander Duyck <alexander.duyck@gmail.com>
References: <20210601033319.100737-1-gshan@redhat.com>
 <76516781-6a70-f2b0-f3e3-da999c84350f@redhat.com>
 <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com>
 <b45b26ea-a6ac-934c-2467-c6e829b5d3ad@redhat.com>
 <74b0d35f-707d-aa11-19e7-fedb74d77159@redhat.com>
 <6ebc99f9-649d-fbd2-aadf-87291e41b36d@redhat.com>
From: Gavin Shan <gshan@redhat.com>
Message-ID: <bd49c9d2-fb0b-5387-45f4-dbaa7a9eac2c@redhat.com>
Date: Wed, 16 Jun 2021 22:59:33 +1000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.0
MIME-Version: 1.0
In-Reply-To: <6ebc99f9-649d-fbd2-aadf-87291e41b36d@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QDuNl616;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf30.hostedemail.com: domain of gshan@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=gshan@redhat.com
X-Stat-Signature: w3khzz994zy6z7bygrp38ea3c48wjn49
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 16DC1E000243
X-HE-Tag: 1623841113-878708
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 6/16/21 5:59 PM, David Hildenbrand wrote:
> On 16.06.21 03:53, Gavin Shan wrote:
>> On 6/14/21 9:03 PM, David Hildenbrand wrote:
>>> On 11.06.21 09:44, Gavin Shan wrote:
>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken a=
s
>>>>>> minimal order (threshold) to trigger page reporting. The page repo=
rting
>>>>>> is never triggered with the following configurations and settings =
on
>>>>>> aarch64. In the particular scenario, the page reporting won't be t=
riggered
>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from t=
he
>>>>>> page freeing. The condition is very hard, or even impossible to be=
 met.
>>>>>>
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_ARM64_PAGE_SHIFT:=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 16
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE:=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 Y
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:=C2=A0=C2=
=A0=C2=A0 N
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 pageblock_order:=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 13
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER:=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 MAX_ORDER:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>>>>
>>>>>> The issue can be reproduced in VM, running kernel with above confi=
gurations
>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB a=
nonymous
>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits=
.
>>>>>>
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 /home/gavin/sandbox/qemu.main/build/qemu-=
system-aarch64=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -accel kvm -machine virt,gic-version=3Dho=
st=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -cpu host -smp 8,sockets=3D2,cores=3D4,th=
reads=3D1 -m 4096M,maxmem=3D64G \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem0,size=
=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem1,size=
=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D0,cpus=3D0-3,memdev=3D=
mem0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D1,cpus=3D4-7,memdev=3D=
mem1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 :=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -device virtio-balloon-pci,id=3Dballoon0,=
free-page-reporting=3Dyes
>>>>>>
>>>>>> This tries to fix the issue by adjusting the threshold to the smal=
ler value
>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU=
's RSS
>>>>>> drops after 'memhog' exits.
>>>>>
>>>>> IIRC, we use pageblock_order to
>>>>>
>>>>> a) Reduce the free page reporting overhead. Reporting on small chun=
ks can make us report constantly with little system activity.
>>>>>
>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM pe=
rformance.
>>>>>
>>>>> c) Avoid affecting creation of pageblock_order pages while hinting =
is active. I think there are cases where "temporary pulling sub-pageblock=
 pages" can negatively affect creation of pageblock_order pages. Concurre=
nt compaction would be one of these cases.
>>>>>
>>>>> The monstrosity called aarch64 64k is really special in that sense,=
 because a) does not apply because pageblocks are just very big, b) does =
sometimes not apply because either our VM isn't backed by (rare) 512MB TH=
P or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs =
because we don't really happen to create 512MB THP either way.
>>>>>
>>>>>
>>>>> For example, going on x86-64 from reporting 2MB to something like 3=
2KB is absolutely undesired.
>>>>>
>>>>> I think if we want to go down that path (and I am not 100% sure yet=
 if we want to), we really want to treat only the special case in a speci=
al way. Note that even when doing it only for aarch64 with 64k, you will =
still end up splitting THP in a hypervisor if it uses 64k base pages (b))=
 and can affect creation of THP, for example, when compacting (c), so the=
re is a negative side to that.
>>>>>
>>>>
>>>> [Remove Alexander from the cc list as his mail isn't reachable]
>>>>
>>>
>>> [adding his gmail address which should be the right one]
>>>
>>>> David, thanks for your time to review and sorry for the delay and la=
te response.
>>>> I spent some time to get myself familiar with the code, but there ar=
e still some
>>>> questions to me, explained as below.
>>>>
>>>> Yes, @pageblock_order is currently taken as page reporting threshold=
. It will
>>>> incur more overhead if the threshold is decreased as you said in (a)=
.
>>>
>>> Right. Alex did quite some performance/overhead evaluation when intro=
ducing this feature. Changing the reporting granularity on most setups (e=
sp., x86-64) is not desired IMHO.
>>>
>>
>> Thanks for adding Alex's correct mail address, David.
>>
>>>>
>>>> This patch tries to decrease the free page reporting threshold. The =
@pageblock_order
>>>> isn't touched. I don't understand how the code changes affecting THP=
 splitting
>>>> and the creation of page blocks mentioned in (b) and (c). David, cou=
ld you please
>>>> provide more details?
>>>
>>> Think of it like this: while reporting to the hypervisor, we temporar=
ily turn free/"movable" pieces part of a pageblock "unmovable" -- see __i=
solate_free_page()->del_page_from_free_list(). While reporting them to th=
e hypervisor, these pages are not available and not even marked as PageBu=
ddy() anymore.
>>>
>>> There are at least two scenarios where this could affect creation of =
free pageblocks I can see:
>>>
>>> a. Compaction. While compacting, we might identify completely movable=
/free pageblocks, however, actual compaction on that pageblock can fail b=
ecause some part is temporarily unmovable.
>>>
>>> b. Free/alloc sequences. Assume a pageblocks is mostly free, except t=
wo pages (x and y). Assume the following sequence:
>>>
>>> 1. free(x)
>>> 2. free(y)
>>> 3. alloc
>>>
>>> Before your change, after 1. and 2. we'll have a free pageblock. 3 wo=
n't allocate from that pageblock.
>>>
>>> With your change, free page reporting might run after 1. After 2, we'=
ll not have a free pageblock (until free page reporting finished), and 3.=
 might just reallocate what we freed in 2 and prevent having a free pageb=
lock.
>>>
>>>
>>> No idea how relevant both points are in practice, however, the fundam=
ental difference to current handling is that we would turn parts of pageb=
locks temporarily unmovable, instead of complete pageblocks.
>>>
>>
>> Thank you for the details. Without my changes and the page reporting t=
hreshold
>> is @pageblock_order, the whole page block can become 'movable' from 'u=
nmovable'.
>> I don't think it's what we want, but I need Alex's confirm.
>=20
> __isolate_free_page() will set the pageblock MIGRATE_MOVABLE in that ca=
se. It's only temporarily unmovable, while we're hinting.
>=20
> Note that MOVABLE vs. UNMOVABLE is just grouping for free pages, and ev=
en setting it to the wrong migratetype isn't "wrong" as in "correctness".=
 It doesn't make a difference if there are no free pages because the whol=
e block is isolated.
>=20

Yes, It doesn't matter since these pages have been isolated. The migratio=
n type is changed to MIGRATE_MOVABLE
in __isolated_free_page(). My questions are actually:

(1) Is it possible the migration type is changed from MIGRATE_UNMOVABLE t=
o MIGRATE_MOVABLE
     in __isolated_free_page()?
(2) After the free page reporting is completed, the migrate type is resto=
red to MIGRATE_UNMOVABLE?

Thanks,
Gavin