From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 47795C433F5
	for <linux-mm@archiver.kernel.org>; Mon, 13 Dec 2021 22:59:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A7B426B0071; Mon, 13 Dec 2021 17:59:02 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A2A7F6B0072; Mon, 13 Dec 2021 17:59:02 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8CAFA6B0074; Mon, 13 Dec 2021 17:59:02 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 7E09C6B0071
	for <linux-mm@kvack.org>; Mon, 13 Dec 2021 17:59:02 -0500 (EST)
Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 3CA2187CAF
	for <linux-mm@kvack.org>; Mon, 13 Dec 2021 22:58:52 +0000 (UTC)
X-FDA: 78914287704.06.C398ABB
Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182])
	by imf23.hostedemail.com (Postfix) with ESMTP id BAF2E14000C
	for <linux-mm@kvack.org>; Mon, 13 Dec 2021 22:58:48 +0000 (UTC)
Received: by mail-pg1-f182.google.com with SMTP id q16so15795534pgq.10
        for <linux-mm@kvack.org>; Mon, 13 Dec 2021 14:58:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=sender:date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:content-transfer-encoding:in-reply-to;
        bh=mEeeqeZ1pW2X5fSLHF8m+94TiEKeCgEAUB2QLiOFCBQ=;
        b=ndX8rDBVJOzwfAQdj5jnpFVau17tt9xqqrPU4W9GJ2yxT3HmIfSQV7q1QTiafqBERy
         aNWY27yX9xmvjfccIwxC7X2JCaKtQk3pjwgdn2BSIVNf+dcKM1vpvjKl7nmzStfBYpUp
         w3YOAHKb8SI/enYR1qyDvBnnJmdpmImqKZeoC+fKmS7c/aopIvUyimmTZ0+krlZBvaDH
         3FcOArumDlBm34LwUDHo9bzEBgEdtshSNfyENGa3fZSVNkeZi5q06DzJ6EYF07Z31ziQ
         B/Zr8D8Sb6A+YET9k5yX1JEpmjSHt1I44Tw+rJbdkT1AauGmjBel9g/lCY0EDal9pJSc
         Ln1w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
         :references:mime-version:content-disposition
         :content-transfer-encoding:in-reply-to;
        bh=mEeeqeZ1pW2X5fSLHF8m+94TiEKeCgEAUB2QLiOFCBQ=;
        b=WzAyBziQzIQDyPsmqIZ/TBRvLsR2LSuF24CVLaJnEseJOeTXP9sNCrqgWlPTdK7vHo
         vgz3ovxg4tc41CGkF4MilOQlS59irHxAF8eJ5EgGCtUaZP/Tr6tU+k6vwkq4LDKUtM9/
         ip/lQp/6dNyv8jKIu5p9OazrAJr2u97STPk+K1bxEOQpJtBUL9XXiJpNQTWE8CuIJmlB
         vKVor5EZTJGw7w0lUtsXQpC4vTI30Tghp7saUYtO6r7YFnPSQeDP4zegaujRr/0U2SYe
         PEQ1cuqppumuzf39KuzSE7/DBEvzw6xv6o5eMs+1aPfMXLmdBFvbXzXQj5c55F/LGH5j
         l6+A==
X-Gm-Message-State: AOAM530mDMRSdMiDF4p6JH3Jhk2hJQc6RNsWzb/kZEoH21PkGnl7nu1F
	nvTL3yNpO3i/pJt3Qg7Qj54=
X-Google-Smtp-Source: ABdhPJw+ZvORX8PLG4RG7NdakTOSMJX8wvyPxI4s0GIBYiYZ9vg3yOgZPLmgKlZa6cXkhiSjtrJNTg==
X-Received: by 2002:a62:7c4b:0:b0:494:66ab:ae0a with SMTP id x72-20020a627c4b000000b0049466abae0amr986549pfc.18.1639436330485;
        Mon, 13 Dec 2021 14:58:50 -0800 (PST)
Received: from google.com ([2620:15c:211:201:7fed:4a3e:d021:bbd0])
        by smtp.gmail.com with ESMTPSA id 145sm8107574pgd.0.2021.12.13.14.58.49
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 13 Dec 2021 14:58:49 -0800 (PST)
Date: Mon, 13 Dec 2021 14:58:47 -0800
From: Minchan Kim <minchan@kernel.org>
To: =?utf-8?B?6IOh546u5paH?= <huww98@outlook.com>
Cc: =?utf-8?B?6IOh546u5paH?= <sehuww@mail.scut.edu.cn>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>,
	David Hildenbrand <david@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	John Dias <joaodias@google.com>
Subject: Re: [RFC] mm: introduce page pinner
Message-ID: <YbfQJ0eTkkImUQ/x@google.com>
References: <20211208115250.GA17274@DESKTOP-N4CECTO.huww98.cn>
 <YbD8naaJrZQANahP@google.com>
 <TYCP286MB2066003B2045AD6ABE4C0295C0719@TYCP286MB2066.JPNP286.PROD.OUTLOOK.COM>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <TYCP286MB2066003B2045AD6ABE4C0295C0719@TYCP286MB2066.JPNP286.PROD.OUTLOOK.COM>
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: BAF2E14000C
X-Stat-Signature: dmdjkopeqdd84iyn7xr77iu39tzem5yp
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=ndX8rDBV;
	spf=pass (imf23.hostedemail.com: domain of minchan.kim@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com;
	dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none)
X-HE-Tag: 1639436328-143327
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Dec 10, 2021 at 05:54:24PM +0800, =E8=83=A1=E7=8E=AE=E6=96=87 wro=
te:
> On Wed, Dec 08, 2021 at 10:42:37AM -0800, Minchan Kim wrote:
> > On Wed, Dec 08, 2021 at 07:54:35PM +0800, =E8=83=A1=E7=8E=AE=E6=96=87=
 wrote:
> > > On Mon, Dec 06, 2021 at 10:47:30AM -0800, Minchan Kim wrote:
> > > > The contiguous memory allocation fails if one of the pages in
> > > > requested range has unexpected elevated reference count since
> > > > VM couldn't migrate the page out. It's very common pattern for
> > > > CMA allocation failure. The temporal elevated page refcount
> > > > could happen from various places and it's really hard to chase
> > > > who held the temporal page refcount at that time, which is the
> > > > vital information to debug the allocation failure.
> >=20
> > Hi,
> >=20
> > Please don't cut down original Cc list without special reason.
>=20
> Sorry, my school SMTP server does not allow that much recipients. I hav=
ed
> changed to outlook.
> =20
> > > Hi Minchan,
> > >=20
> > > I'm a newbie here. We are debugging a problem where every CPU core =
is doing
> > > compaction but making no progress, because of the unexpected page r=
efcount. I'm
> > > interested in your approach, but this patch seems only to cover the=
 CMA
> > > allocation path. So could it be extended to debugging migrate failu=
re during
> > > compaction?  I'm not familiar with the kernel codebase, here is my =
untested
> > > thought:
> >=20
> > The compaction failure will produce a lot events I wanted to avoid
> > in my system but I think your case is reasonable if you doesn't
> > mind the large events.
> >=20
> > >=20
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index cf25b00f03c8..85dacbca8fa0 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -46,6 +46,7 @@
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/page_idle.h>
> > >  #include <linux/page_owner.h>
> > > +#include <linux/page_pinner.h>
> > >  #include <linux/sched/mm.h>
> > >  #include <linux/ptrace.h>
> > >  #include <linux/oom.h>
> > > @@ -388,8 +389,10 @@ int folio_migrate_mapping(struct address_space=
 *mapping,
> > > =20
> > >         if (!mapping) {
> > >                 /* Anonymous page without mapping */
> > > -               if (folio_ref_count(folio) !=3D expected_count)
> > > +               if (folio_ref_count(folio) !=3D expected_count) {
> > > +                       page_pinner_failure(&folio->page);
> > >                         return -EAGAIN;
> > > +               }
> > > =20
> > >                 /* No turning back from here */
> > >                 newfolio->index =3D folio->index;
> > > @@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space =
*mapping,
> > >         xas_lock_irq(&xas);
> > >         if (!folio_ref_freeze(folio, expected_count)) {
> > >                 xas_unlock_irq(&xas);
> > > +               page_pinner_failure(&folio->page);
> > >                 return -EAGAIN;
> > >         }
> > >=20
> > > I'm not sure what to do with the new folio, it seems using folio->p=
age in new
> > > codes is not correct.
>=20
> I tested the above proposed patch, it works in my case. But it produces=
 a lot of
> redundant page_pinner_put events. Before the true pinner reveals, the t=
raced
> pages are get and put multiple times. Besides, when passed to
> page_pinner_failure(), the "count" is 3 in my case, any of the 3 holder=
s could
> be the interested pinner. I think this is hard to avoid, and we can jus=
t let the
> users distinguish which is the interested pinner. Maybe we need some do=
cs about
> this.

If you run experiement a bit more, you can find what's the most interesti=
ng
callsites because usually the other two are would be common path like lru
drainnig or rmap for page migration. Maybe you don't want to pay attentio=
n
for them. If they are too nosiy we may introduce some annotation to not
produce events on such callsites optinally. However, let's do not make
thing too complicated from the beginning.

>=20
> > If you want to cover compaction only, maybe this one:
> >=20
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index bfc93da1c2c7..7bfbf7205fb8 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -2400,6 +2400,11 @@ compact_zone(struct compact_control *cc, struc=
t capture_control *capc)
> >                 /* All pages were either migrated or will be released=
 */
> >                 cc->nr_migratepages =3D 0;
> >                 if (err) {
> > +                       struct page *failed_page;
> > +
> > +                       list_for_each_entry(failed_page, &cc->migrate=
pages, lru)
> > +                               page_pinner_failure(failed_page);
> > +
> >                         putback_movable_pages(&cc->migratepages);
> >                         /*
> >                          * migrate_pages() may return -ENOMEM when sc=
anners meet
>=20
> Maybe we should put the page_pinner_failure() calls as close to the rea=
l
> refcount check as possible to avoid protential racing and loss some
> page_pinner_put events? Besides, migrate_pages() will retry for 10 time=
s, and I
> image that someone may want to find out who is causing the retry. And m=
igration
> may fail for a number of reason, not only unexpected refcount.

Yeah, I encountered couple of different reasons when I chased down CMA
issues(e.gl., LRU isolation failure) but I wanted to focus page refcount
issue with this page_pinner. I thought each subsystem(cma, compaction,
memory-hotplug, or something of migration users) would have different
requirements/stategey so they would provide their own trace events
for other failure purposes unless it's related to page pinning.

And the page_pinner is already racy since it doesn't take any lock
to mark the failure. So I thought it's not a problem not to close
trigger the trace event from the real refcount check. Since the work
is based on deduction, user should catch it up. Yeah, instead, we could
warn about the race issue into doc for user.

>=20
> I image that enabling page pinner for migration senarios other than com=
paction
> could be helpful for others.

It's not too hard to add new tracepoint in the general migrate_pages so
I want to keep it leave for only alloc_contig user and compaction until
we hear usecases from real users.

>=20
> > However, for the case, I want to introduce some filter options like
> > failure reason(?)
> >=20
> >     page_pinner_failure(pfn, reason)
> >=20
> > So,  I could keep getting only CMA allocation failure events, not
> > compaction failure.
>=20
> This is a good idea to me. But how can we implement the filter? Can we =
reuse the
> trace event filter? i.e., if the page_pinner_failure event is filtered =
out, then
> we don't set the PAGE_EXT_PINNER flag and effectively also filter the
> corresponding page_pinner_put event out. I can't see whether it is poss=
ible now.
> trace_page_pinner_failure() returns void, so it seems we cannot know wh=
ether the
> event got through.

We can introduce mutiple failure reports something like debug_page_ref.c

void page_pinner_alloc_contig_failure(struct *page)
{
    if (page_pinner_tracepoint_active(pp_fail_alloc_contig))
        __page_pinner_aloc_contig_failure(page);
}

void __page_pinner_aloc_contig_failure(struct page *page)
{
    trace_pp_alloc_contig_failure(page);
}

void page_pinner_compaction_failure(struct *page)
{
    if (page_pinner_tracepoint_active(pp_fail_compacion))
        __page_pinner_compaction_failure(page);
}

void __page_pinner_compaction_failure(struct page *page)
{
    trace_pp_compaction_failure(page);
}

So admin can enable interested events only.

>=20
> If this is not possible, we may need to allocate additional space to st=
ore the
> reason for each traced page, and also pass the reason to trace_page_pin=
ner_put().
>=20

< snip>

> > > > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > > > index 1e73717802f8..0ad4a3b8f4eb 100644
> > > > --- a/mm/Kconfig.debug
> > > > +++ b/mm/Kconfig.debug
> > > > @@ -62,6 +62,19 @@ config PAGE_OWNER
> > > > =20
> > > >  	  If unsure, say N.
> > > > =20
> > > > +config PAGE_PINNER
> > > > +	bool "Track page pinner"
> > > > +	select PAGE_EXTENSION
> > > > +	depends on DEBUG_KERNEL && TRACEPOINTS
> > > > +	help
> > > > +	  This keeps track of what call chain is the pinner of a page, =
may
> > > > +	  help to find contiguos page allocation failure. Even if you i=
nclude
> > > > +	  this feature in your build, it is disabled by default. You sh=
ould
> > > > +	  pass "page_pinner=3Don" to boot parameter in order to enable =
it. Eats
> > > > +	  a fair amount of memory if enabled.
> > >=20
> > > I'm a bit confused. It seems page pinner does not allocate any addi=
tional
> > > memory if you enable it by boot parameter. So the description seems=
 inaccurate.
> >=20
> > It will allocate page_ext descriptors so consumes the memory.
>=20
> Thanks, I see. So it is 8 bytes for each 4k page. Not much I think.

Yub but for someone it's not too small, either considering how often
the problem happen. ;-)

> > > More info about my compaction issue:
> > >=20
> > > This call stack returns -EAGAIN in 99.9% cases on the problematic h=
ost
> > > (Ubuntu 20.04 with kernel 5.11.0-40):
> > >=20
> > > migrate_page_move_mapping (now folio_migrate_mapping) <- returns -E=
AGAIN
> > > migrate_page
> > > fallback_migrate_page
> > > move_to_new_page
> > > migrate_pages
> > > compact_zone
> > > compact_zone_order
> > > try_to_compact_pages
> > > __alloc_pages_direct_compact
> > > __alloc_pages_slowpath.constprop.0
> > > __alloc_pages_nodemask
> > > alloc_pages_vma
> > > do_huge_pmd_anonymous_page
> > > __handle_mm_fault
> > > handle_mm_fault
> > > do_user_addr_fault
> > > exc_page_fault
> > > asm_exc_page_fault
> > >=20
> > > The offending pages are from shm, allocated by mmap() with MAP_SHAR=
ED by a
> > > machine learning program. They may have relationships with NVIDIA C=
UDA, but I
> > > want to confirm this, and make improvements if possible.
> >=20
> > So you are suspecting some kernel driver hold a addtional refcount
> > using get_user_pages or page get API?
>=20
> Yes. By using the trace events in this patch, I have confirmed it is nv=
idia
> kernel module that holds the refcount. I got the stacktrace like this (=
from
> "perf script"):
>=20
> cuda-EvtHandlr 31023 [000]  3244.976411:                   page_pinner:=
page_pinner_put: pfn=3D0x13e473 flags=3D0x8001e count=3D0 mapcount=3D0 ma=
pping=3D(nil) mt=3D1
>         ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/b=
uild/vmlinux)
>         ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/b=
uild/vmlinux)
>         ffffffffc0b71e1f os_unlock_user_pages+0xbf ([nvidia])
>         ffffffffc14a4546 _nv032165rm+0x96 ([nvidia])
>=20
> Still not much information. NVIDIA does not want me to debug its module=
. Maybe
> the only thing I can do is reporting this to NVIDIA.

I'm glad that you found the problem using page_pinner and the trace
could help to move them to dig in it. Looks like that it is already
happening.

>=20
> > > When the issue reproduce, a single page fault that triggers a sync =
compaction
> > > can take tens of seconds. Then all 40 CPU threads are doing compact=
ion, and
> > > application runs several order of magnitude slower.
> > >=20
> > > Disabling sync compaction is a workaround (the default is "madvise"=
):
> > >=20
> > > echo never > /sys/kernel/mm/transparent_hugepage/defrag
> > >=20
> > > Previously I asked for help at https://lore.kernel.org/linux-mm/202=
10516085644.13800-1-hdanton@sina.com/
> > > Now I have more information but still cannot pinpoint the root caus=
e.
> > >=20
> > > Thanks,
> > > Hu Weiwen

Thanks for commenting and shaing your experience, Hu.