From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=bVtU=LJ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E6EE4C49EA2
	for <linux-mm@archiver.kernel.org>; Tue, 15 Jun 2021 23:52:33 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 93E4361369
	for <linux-mm@archiver.kernel.org>; Tue, 15 Jun 2021 23:52:33 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 93E4361369
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2BC206B006C; Tue, 15 Jun 2021 19:52:33 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 292436B006E; Tue, 15 Jun 2021 19:52:33 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 133D06B0070; Tue, 15 Jun 2021 19:52:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0094.hostedemail.com [216.40.44.94])
	by kanga.kvack.org (Postfix) with ESMTP id D6E746B006C
	for <linux-mm@kvack.org>; Tue, 15 Jun 2021 19:52:32 -0400 (EDT)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 73F99181AC9C6
	for <linux-mm@kvack.org>; Tue, 15 Jun 2021 23:52:32 +0000 (UTC)
X-FDA: 78257610144.11.A728AAA
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf04.hostedemail.com (Postfix) with ESMTP id D003F369
	for <linux-mm@kvack.org>; Tue, 15 Jun 2021 23:52:25 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1623801151;
	h=from:from:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=mYpZ62mnOVCAysL58NZOznPfS67Uz89cb607A1bXdGs=;
	b=T8iVUdYINotK1YSeiSVczEp30Q+gJCTxODuOD1Lb7BLW0EKIjtRZmsbFQpshruSTqMo3FK
	U3kC3O12SmgyEa4gIo3D/fq7sqjKro4a8Ec8s9z+jRU03AGn7P7odOlCrluLDchM6USS5I
	v53mrc2uZFA6/ZsPPmle2bu0NwFASEo=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-370-05jhUJEAPeClir-fXcUGRw-1; Tue, 15 Jun 2021 19:52:27 -0400
X-MC-Unique: 05jhUJEAPeClir-fXcUGRw-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C39C9803621;
	Tue, 15 Jun 2021 23:52:26 +0000 (UTC)
Received: from [10.64.54.84] (vpn2-54-84.bne.redhat.com [10.64.54.84])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 980861002EF0;
	Tue, 15 Jun 2021 23:52:20 +0000 (UTC)
Reply-To: Gavin Shan <gshan@redhat.com>
Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to
 MAX_ORDER
To: David Hildenbrand <david@redhat.com>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
 shan.gavin@gmail.com, Anshuman Khandual <anshuman.khandual@arm.com>,
 Alexander Duyck <alexander.duyck@gmail.com>
References: <20210601033319.100737-1-gshan@redhat.com>
 <76516781-6a70-f2b0-f3e3-da999c84350f@redhat.com>
 <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com>
 <b45b26ea-a6ac-934c-2467-c6e829b5d3ad@redhat.com>
From: Gavin Shan <gshan@redhat.com>
Message-ID: <74b0d35f-707d-aa11-19e7-fedb74d77159@redhat.com>
Date: Wed, 16 Jun 2021 11:53:22 +1000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.0
MIME-Version: 1.0
In-Reply-To: <b45b26ea-a6ac-934c-2467-c6e829b5d3ad@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=T8iVUdYI;
	spf=none (imf04.hostedemail.com: domain of gshan@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=gshan@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Stat-Signature: edero8qp7bqcf6hoz1haogthdx9o1i8x
X-Rspamd-Queue-Id: D003F369
X-Rspamd-Server: rspam06
X-HE-Tag: 1623801145-361340
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 6/14/21 9:03 PM, David Hildenbrand wrote:
> On 11.06.21 09:44, Gavin Shan wrote:
>> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>>> On 01.06.21 05:33, Gavin Shan wrote:
>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>>> minimal order (threshold) to trigger page reporting. The page report=
ing
>>>> is never triggered with the following configurations and settings on
>>>> aarch64. In the particular scenario, the page reporting won't be tri=
ggered
>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>>> page freeing. The condition is very hard, or even impossible to be m=
et.
>>>>
>>>> =C2=A0=C2=A0=C2=A0 CONFIG_ARM64_PAGE_SHIFT:=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 16
>>>> =C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =
Y
>>>> =C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:=C2=A0=C2=A0=C2=
=A0 N
>>>> =C2=A0=C2=A0=C2=A0 pageblock_order:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 13
>>>> =C2=A0=C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER:=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>> =C2=A0=C2=A0=C2=A0 MAX_ORDER:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>>
>>>> The issue can be reproduced in VM, running kernel with above configu=
rations
>>>> and settings. The 'memhog' is used inside the VM to access 512MB ano=
nymous
>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>>
>>>> =C2=A0=C2=A0=C2=A0 /home/gavin/sandbox/qemu.main/build/qemu-system-a=
arch64=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>> =C2=A0=C2=A0=C2=A0 -accel kvm -machine virt,gic-version=3Dhost=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>> =C2=A0=C2=A0=C2=A0 -cpu host -smp 8,sockets=3D2,cores=3D4,threads=3D=
1 -m 4096M,maxmem=3D64G \
>>>> =C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem0,size=3D2048M=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>> =C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem1,size=3D2048M=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>> =C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D0,cpus=3D0-3,memdev=3Dmem0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>> =C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D1,cpus=3D4-7,memdev=3Dmem1=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 :=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>> =C2=A0=C2=A0=C2=A0 -device virtio-balloon-pci,id=3Dballoon0,free-pag=
e-reporting=3Dyes
>>>>
>>>> This tries to fix the issue by adjusting the threshold to the smalle=
r value
>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's=
 RSS
>>>> drops after 'memhog' exits.
>>>
>>> IIRC, we use pageblock_order to
>>>
>>> a) Reduce the free page reporting overhead. Reporting on small chunks=
 can make us report constantly with little system activity.
>>>
>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM perf=
ormance.
>>>
>>> c) Avoid affecting creation of pageblock_order pages while hinting is=
 active. I think there are cases where "temporary pulling sub-pageblock p=
ages" can negatively affect creation of pageblock_order pages. Concurrent=
 compaction would be one of these cases.
>>>
>>> The monstrosity called aarch64 64k is really special in that sense, b=
ecause a) does not apply because pageblocks are just very big, b) does so=
metimes not apply because either our VM isn't backed by (rare) 512MB THP =
or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs be=
cause we don't really happen to create 512MB THP either way.
>>>
>>>
>>> For example, going on x86-64 from reporting 2MB to something like 32K=
B is absolutely undesired.
>>>
>>> I think if we want to go down that path (and I am not 100% sure yet i=
f we want to), we really want to treat only the special case in a special=
 way. Note that even when doing it only for aarch64 with 64k, you will st=
ill end up splitting THP in a hypervisor if it uses 64k base pages (b)) a=
nd can affect creation of THP, for example, when compacting (c), so there=
 is a negative side to that.
>>>
>>
>> [Remove Alexander from the cc list as his mail isn't reachable]
>>
>=20
> [adding his gmail address which should be the right one]
>=20
>> David, thanks for your time to review and sorry for the delay and late=
 response.
>> I spent some time to get myself familiar with the code, but there are =
still some
>> questions to me, explained as below.
>>
>> Yes, @pageblock_order is currently taken as page reporting threshold. =
It will
>> incur more overhead if the threshold is decreased as you said in (a).
>=20
> Right. Alex did quite some performance/overhead evaluation when introdu=
cing this feature. Changing the reporting granularity on most setups (esp=
., x86-64) is not desired IMHO.
>=20

Thanks for adding Alex's correct mail address, David.

>>
>> This patch tries to decrease the free page reporting threshold. The @p=
ageblock_order
>> isn't touched. I don't understand how the code changes affecting THP s=
plitting
>> and the creation of page blocks mentioned in (b) and (c). David, could=
 you please
>> provide more details?
>=20
> Think of it like this: while reporting to the hypervisor, we temporaril=
y turn free/"movable" pieces part of a pageblock "unmovable" -- see __iso=
late_free_page()->del_page_from_free_list(). While reporting them to the =
hypervisor, these pages are not available and not even marked as PageBudd=
y() anymore.
>=20
> There are at least two scenarios where this could affect creation of fr=
ee pageblocks I can see:
>=20
> a. Compaction. While compacting, we might identify completely movable/f=
ree pageblocks, however, actual compaction on that pageblock can fail bec=
ause some part is temporarily unmovable.
>=20
> b. Free/alloc sequences. Assume a pageblocks is mostly free, except two=
 pages (x and y). Assume the following sequence:
>=20
> 1. free(x)
> 2. free(y)
> 3. alloc
>=20
> Before your change, after 1. and 2. we'll have a free pageblock. 3 won'=
t allocate from that pageblock.
>=20
> With your change, free page reporting might run after 1. After 2, we'll=
 not have a free pageblock (until free page reporting finished), and 3. m=
ight just reallocate what we freed in 2 and prevent having a free pageblo=
ck.
>=20
>=20
> No idea how relevant both points are in practice, however, the fundamen=
tal difference to current handling is that we would turn parts of pageblo=
cks temporarily unmovable, instead of complete pageblocks.
>=20

Thank you for the details. Without my changes and the page reporting thre=
shold
is @pageblock_order, the whole page block can become 'movable' from 'unmo=
vable'.
I don't think it's what we want, but I need Alex's confirm.

If we needn't change page block's migration type in page reporting, I gue=
ss I
need additional parameter for __isolate_free_page() so that the migration=
 type
won't be changed by page reporting.

For (a), the 'movable' and 'unmovable' type is maintained with page block
granularity. So it seems the pages in one page block can't have different
migration types, or I missed something.

For (b), the scenario is possible to happen. It means the changed page
reporting threshold could affect the page allocator's behaviour, which co=
uld
introduce more fragmentations. However, it's really depending on how the
memory is allocated.

Thanks,
Gavin