From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=paXG=LI=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BEF0DC48BE6
	for <linux-mm@archiver.kernel.org>; Mon, 14 Jun 2021 11:03:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 681D661241
	for <linux-mm@archiver.kernel.org>; Mon, 14 Jun 2021 11:03:46 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 681D661241
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 08C826B006C; Mon, 14 Jun 2021 07:03:46 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 03BB76B006E; Mon, 14 Jun 2021 07:03:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DF9AC6B0070; Mon, 14 Jun 2021 07:03:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0168.hostedemail.com [216.40.44.168])
	by kanga.kvack.org (Postfix) with ESMTP id AC6416B006C
	for <linux-mm@kvack.org>; Mon, 14 Jun 2021 07:03:45 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id C0A816D7D
	for <linux-mm@kvack.org>; Mon, 14 Jun 2021 11:03:43 +0000 (UTC)
X-FDA: 78252043926.27.A11470F
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf06.hostedemail.com (Postfix) with ESMTP id 4DE81C0091AE
	for <linux-mm@kvack.org>; Mon, 14 Jun 2021 11:03:35 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1623668622;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=sJMaSmuz1GMXnRUM5BovWQC3Er64otWu36vw7FnE244=;
	b=cmBcxrg63OZrXTYEP/1STb4aTHuayetQgBjNpishxohjvhESJXf8uIaWQ+y9JZxB5MsyaD
	/GYuvrTr3qWsim5D1DDy6qA17E6nr7q8h5uhz+P/bckbPYcU7JAyaHeNCkRdVS2l+8kjC8
	0iqIeR/CHvwjvlRwRHLVH9Yi9RdC+sA=
Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com
 [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-319-oxlot2T6OL6BxSZJmivjUw-1; Mon, 14 Jun 2021 07:03:41 -0400
X-MC-Unique: oxlot2T6OL6BxSZJmivjUw-1
Received: by mail-wr1-f71.google.com with SMTP id n2-20020adfb7420000b029010e47b59f31so6818480wre.9
        for <linux-mm@kvack.org>; Mon, 14 Jun 2021 04:03:41 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=sJMaSmuz1GMXnRUM5BovWQC3Er64otWu36vw7FnE244=;
        b=YoKv9EypoRu5GtZRnmhy20XEjgAtBZHOsY8sJXDMMH3zJNNrWeTb0h6EKd9W6deiTw
         9XFW5YvOyn6ooaDz7nmpvc3B/l3ZwnRR1u65Ucvrxbtcl4vyRh7OKCPqHgBAHhvpC1Zr
         28iE3tj6Z3JXlT6l2vLQRRjzzaWFS9bDeCYzUEWMqsVolB4nX5PqAWb/w6OKnZOkBMcy
         R2kXHoeSf5AYFn49FASXQ8jNXudalciqaYF5fzT1srWb6njFwfsLaAcpceiWJVkCGC+o
         WtBhFxCxctO5kuFOy65os0+sRddQ4XDOQlbunovmBzRDZuL/QiG2tbdHF/dJMkKV+JGQ
         jTqQ==
X-Gm-Message-State: AOAM530ZaK0HKe8pxWhxHNwEjHyKTDlR15AYryv3+9Wu4UfXq0dDdqVl
	VLZNk+ebsn/IQxJZcZi9pbC1IDvP5Y6gBK2ZjRwDFp/UY0NrBL8a38yA+AmUG6+rDNluUo5NeNH
	JqJ9Od9n1Dkg=
X-Received: by 2002:a1c:a550:: with SMTP id o77mr15699684wme.57.1623668620221;
        Mon, 14 Jun 2021 04:03:40 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyrTngvJqfwmriR0HMQ95ub7/bHyzSRj1nYwyt7Na2eQf2t/8ZiilRvT9BgA7QkefESecd3GQ==
X-Received: by 2002:a1c:a550:: with SMTP id o77mr15699663wme.57.1623668619996;
        Mon, 14 Jun 2021 04:03:39 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c66ca.dip0.t-ipconnect.de. [91.12.102.202])
        by smtp.gmail.com with ESMTPSA id q5sm16540961wrm.15.2021.06.14.04.03.39
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 14 Jun 2021 04:03:39 -0700 (PDT)
To: Gavin Shan <gshan@redhat.com>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
 shan.gavin@gmail.com, Anshuman Khandual <anshuman.khandual@arm.com>,
 Alexander Duyck <alexander.duyck@gmail.com>
References: <20210601033319.100737-1-gshan@redhat.com>
 <76516781-6a70-f2b0-f3e3-da999c84350f@redhat.com>
 <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to
 MAX_ORDER
Message-ID: <b45b26ea-a6ac-934c-2467-c6e829b5d3ad@redhat.com>
Date: Mon, 14 Jun 2021 13:03:38 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.1
MIME-Version: 1.0
In-Reply-To: <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cmBcxrg6;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf06.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com
X-Rspamd-Server: rspam02
X-Stat-Signature: obh15y5sd95j9rxzxc3wgb4ppuie35i8
X-Rspamd-Queue-Id: 4DE81C0091AE
X-HE-Tag: 1623668615-912192
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 11.06.21 09:44, Gavin Shan wrote:
> On 6/1/21 6:01 PM, David Hildenbrand wrote:
>> On 01.06.21 05:33, Gavin Shan wrote:
>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
>>> minimal order (threshold) to trigger page reporting. The page reporti=
ng
>>> is never triggered with the following configurations and settings on
>>> aarch64. In the particular scenario, the page reporting won't be trig=
gered
>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
>>> page freeing. The condition is very hard, or even impossible to be me=
t.
>>>
>>>  =C2=A0=C2=A0 CONFIG_ARM64_PAGE_SHIFT:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 16
>>>  =C2=A0=C2=A0 CONFIG_HUGETLB_PAGE:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Y
>>>  =C2=A0=C2=A0 CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:=C2=A0=C2=A0=C2=A0 N
>>>  =C2=A0=C2=A0 pageblock_order:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0 13
>>>  =C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>  =C2=A0=C2=A0 MAX_ORDER:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14
>>>
>>> The issue can be reproduced in VM, running kernel with above configur=
ations
>>> and settings. The 'memhog' is used inside the VM to access 512MB anon=
ymous
>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
>>>
>>>  =C2=A0=C2=A0 /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>  =C2=A0=C2=A0 -accel kvm -machine virt,gic-version=3Dhost=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>  =C2=A0=C2=A0 -cpu host -smp 8,sockets=3D2,cores=3D4,threads=3D1 -m 4=
096M,maxmem=3D64G \
>>>  =C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem0,size=3D2048M=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>  =C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem1,size=3D2048M=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>  =C2=A0=C2=A0 -numa node,nodeid=3D0,cpus=3D0-3,memdev=3Dmem0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>  =C2=A0=C2=A0 -numa node,nodeid=3D1,cpus=3D4-7,memdev=3Dmem1=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>  =C2=A0=C2=A0=C2=A0=C2=A0 :=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 \
>>>  =C2=A0=C2=A0 -device virtio-balloon-pci,id=3Dballoon0,free-page-repo=
rting=3Dyes
>>>
>>> This tries to fix the issue by adjusting the threshold to the smaller=
 value
>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's =
RSS
>>> drops after 'memhog' exits.
>>
>> IIRC, we use pageblock_order to
>>
>> a) Reduce the free page reporting overhead. Reporting on small chunks =
can make us report constantly with little system activity.
>>
>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM perfo=
rmance.
>>
>> c) Avoid affecting creation of pageblock_order pages while hinting is =
active. I think there are cases where "temporary pulling sub-pageblock pa=
ges" can negatively affect creation of pageblock_order pages. Concurrent =
compaction would be one of these cases.
>>
>> The monstrosity called aarch64 64k is really special in that sense, be=
cause a) does not apply because pageblocks are just very big, b) does som=
etimes not apply because either our VM isn't backed by (rare) 512MB THP o=
r uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs bec=
ause we don't really happen to create 512MB THP either way.
>>
>>
>> For example, going on x86-64 from reporting 2MB to something like 32KB=
 is absolutely undesired.
>>
>> I think if we want to go down that path (and I am not 100% sure yet if=
 we want to), we really want to treat only the special case in a special =
way. Note that even when doing it only for aarch64 with 64k, you will sti=
ll end up splitting THP in a hypervisor if it uses 64k base pages (b)) an=
d can affect creation of THP, for example, when compacting (c), so there =
is a negative side to that.
>>
>=20
> [Remove Alexander from the cc list as his mail isn't reachable]
>=20

[adding his gmail address which should be the right one]

> David, thanks for your time to review and sorry for the delay and late =
response.
> I spent some time to get myself familiar with the code, but there are s=
till some
> questions to me, explained as below.
>=20
> Yes, @pageblock_order is currently taken as page reporting threshold. I=
t will
> incur more overhead if the threshold is decreased as you said in (a).

Right. Alex did quite some performance/overhead evaluation when=20
introducing this feature. Changing the reporting granularity on most=20
setups (esp., x86-64) is not desired IMHO.

>=20
> This patch tries to decrease the free page reporting threshold. The @pa=
geblock_order
> isn't touched. I don't understand how the code changes affecting THP sp=
litting
> and the creation of page blocks mentioned in (b) and (c). David, could =
you please
> provide more details?

Think of it like this: while reporting to the hypervisor, we temporarily=20
turn free/"movable" pieces part of a pageblock "unmovable" -- see=20
__isolate_free_page()->del_page_from_free_list(). While reporting them=20
to the hypervisor, these pages are not available and not even marked as=20
PageBuddy() anymore.

There are at least two scenarios where this could affect creation of=20
free pageblocks I can see:

a. Compaction. While compacting, we might identify completely=20
movable/free pageblocks, however, actual compaction on that pageblock=20
can fail because some part is temporarily unmovable.

b. Free/alloc sequences. Assume a pageblocks is mostly free, except two=20
pages (x and y). Assume the following sequence:

1. free(x)
2. free(y)
3. alloc

Before your change, after 1. and 2. we'll have a free pageblock. 3 won't=20
allocate from that pageblock.

With your change, free page reporting might run after 1. After 2, we'll=20
not have a free pageblock (until free page reporting finished), and 3.=20
might just reallocate what we freed in 2 and prevent having a free=20
pageblock.


No idea how relevant both points are in practice, however, the=20
fundamental difference to current handling is that we would turn parts=20
of pageblocks temporarily unmovable, instead of complete pageblocks.

--=20
Thanks,

David / dhildenb