From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: liubo <liubo254@huawei.com>,
akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, hughd@google.com,
willy@infradead.org
Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps
Date: Thu, 27 Jul 2023 11:13:16 -0400 [thread overview]
Message-ID: <ZMKJjDaqZ7FW0jfe@x1n> (raw)
In-Reply-To: <f49c2a51-4dd8-784b-57fa-34fb397db2b7@redhat.com>
On Thu, Jul 27, 2023 at 03:28:49PM +0200, David Hildenbrand wrote:
> > > > Therefore, when obtaining pages through the follow_trans_huge_pmd
> > > > interface, add the FOLL_FORCE flag to count the pages corresponding to
> > > > PROTNONE to solve the above problem.
> > > >
> > >
> > > We really want to avoid the usage of FOLL_FORCE, and ideally limit it
> > > to ptrace only.
> >
> > Fundamentally when removing FOLL_NUMA we did already assumed !FORCE is
> > FOLL_NUMA. It means to me after the removal it's not possible to say in a
> > gup walker that "it's not FORCEd, but I don't want to trigger NUMA but just
> > get the page".
> >
> > Is that what we want? Shall we document that in FOLL_FORCE if we intended
> > to enforce numa balancing as long as !FORCE?
>
> That was the idea, yes. I could have sworn we had that at least in some
> patch description.
>
> Back then, I played with special-casing on gup_can_follow_protnone() on
> FOLL_GET | FOLL_PIN. But it's all just best guesses.
>
> Can always be added if deemed necessary and worth it.
>
> Here, it's simply an abuse of that GUP function that I wasn't aware of --
> otherwise I'd have removed that before hand.
>
> >
> > >
> > > > Signed-off-by: liubo <liubo254@huawei.com>
> > > > Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> > > > ---
> > > > fs/proc/task_mmu.c | 6 ++++--
> > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index c1e6531cb02a..ed08f9b869e2 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > > bool migration = false;
> > > >
> > > > if (pmd_present(*pmd)) {
> > > > - /* FOLL_DUMP will return -EFAULT on huge zero page */
> > > > - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> > > > + /* FOLL_DUMP will return -EFAULT on huge zero page
> > > > + * FOLL_FORCE follow a PROT_NONE mapped page
> > > > + */
> > > > + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE);
> > > > } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> > > > swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > >
> > > Might do as an easy fix. But we really should get rid of that
> > > absolutely disgusting usage of follow_trans_huge_pmd().
> > >
> > > We don't need 99% of what follow_trans_huge_pmd() does here.
> > >
> > > Would the following also fix your issue?
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 507cd4e59d07..fc744964816e 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -587,8 +587,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > bool migration = false;
> > >
> > > if (pmd_present(*pmd)) {
> > > - /* FOLL_DUMP will return -EFAULT on huge zero page */
> > > - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> > > + page = vm_normal_page_pmd(vma, addr, *pmd);
> > > } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> > > swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > >
> > > It also skips the shared zeropage and pmd_devmap(),
> > >
> > > Otherwise, a simple pmd_page(*pmd) + is_huge_zero_pmd(*pmd) check will do, but I
> > > suspect vm_normal_page_pmd() might be what we actually want to have here.
> > >
> > > Because smaps_pte_entry() properly checks for vm_normal_page().
> >
> > There're indeed some very trivial detail in vm_normal_page_pmd() that's
> > different, but maybe not so relevant. E.g.,
> >
> > if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
> > return -ENOMEM;
>
> Note that we're not even passing FOLL_GET | FOLL_PIN. Because we're not
> actually doing GUP. So the refcount is not that relevant.
>
> >
> > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> > return -EREMOTEIO;
> >
> > I'm not sure whether the p2pdma page would matter in any form here. E.g.,
> > whether it can be mapped privately.
>
> Good point, but I don't think that people messing with GUP even imagined
> that we would call that function from a !GUP place.
>
> This was wrong from the very start. If we're not in GUP, we shouldn't call
> GUP functions.
My understanding is !GET && !PIN is also called gup.. otherwise we don't
need GET and it can just be always implied.
The other proof is try_grab_page() doesn't fail hard on !GET && !PIN. So I
don't know whether that's "wrong" to be used..
Back to the topic: I'd say either of the patches look good to solve the
problem. If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess
vm_normal_page_pmd() proposed here will also work on it, so nothing I see
wrong on 2nd one yet.
It looks nicer indeed to not have FOLL_FORCE here, but it also makes me
just wonder whether we should document NUMA behavior for FOLL_* somewhere,
because we have an implication right now on !FOLL_FORCE over NUMA, which is
not obvious to me..
And to look more over that aspect, see follow_page(): previously we can
follow a page for protnone (as it never applies FOLL_NUMA) but now it won't
(it never applies FOLL_FORCE, either, so it seems "accidentally" implies
FOLL_NUMA now). Not sure whether it's intended, though..
--
Peter Xu
next prev parent reply other threads:[~2023-07-27 15:14 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-26 7:34 [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps liubo
2023-07-27 11:37 ` David Hildenbrand
2023-07-27 13:15 ` Peter Xu
2023-07-27 13:28 ` David Hildenbrand
2023-07-27 15:13 ` Peter Xu [this message]
2023-07-27 17:27 ` David Hildenbrand
2023-07-27 18:59 ` Peter Xu
2023-07-27 19:17 ` David Hildenbrand
2023-07-27 20:30 ` Peter Xu
2023-07-27 20:43 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZMKJjDaqZ7FW0jfe@x1n \
--to=peterx@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=liubo254@huawei.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.