From: Mike Kravetz <mike.kravetz@oracle.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@redhat.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Naoya Horiguchi <naoya.horiguchi@linux.dev>,
James Houghton <jthoughton@google.com>,
Peter Xu <peterx@redhat.com>, Yang Shi <shy828301@gmail.com>,
Vishal Moola <vishal.moola@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Muchun Song <songmuchun@bytedance.com>,
stable@vger.kernel.org
Subject: Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
Date: Wed, 1 Feb 2023 13:05:35 -0800 [thread overview]
Message-ID: <Y9rUHw2kuSwg2ntI@monkey> (raw)
In-Reply-To: <Y9oY9850e/8LQ78i@dhcp22.suse.cz>
On 02/01/23 08:47, Michal Hocko wrote:
> On Mon 30-01-23 14:08:47, Mike Kravetz wrote:
> > On 01/30/23 13:36, Michal Hocko wrote:
> > > On Fri 27-01-23 17:12:05, Mike Kravetz wrote:
> > > > On 01/27/23 15:04, Andrew Morton wrote:
> > > > > On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:
> > > > > > On 26.01.23 23:27, Mike Kravetz wrote:
> > >
> > > Yes, this looks simple enough. My only concern would be that this
> > > special casing might be required on other places which is hard to
> > > evaluate. I thought PSS reported by smaps would be broken as well but it
> > > seems pss is not really accounted for hugetlb mappings at all.
> > >
> > > Have you tried to look into {in,de}creasing the map count of the page when
> > > the the pte is {un}shared for it?
> >
> > A quick thought is that it would not be too difficult. It would need
> > to include the following:
> > - At PMD share time in huge_pmd_share(),
> > Go through all entries in the PMD, and increment map and ref count for
> > all referenced pages. huge_pmd_share is just adding another sharing
> > process.
> > - At PMD unshare time in huge_pmd_unshare(),
> > Go through all entries in the PMD, and decrement map and ref count for
> > all referenced pages. huge_pmd_unshare is just removing one sharing
> > process.
> > - At page fault time, check if we are adding a new entry to a shared PMD.
> > If yes, add 'num_of_sharing__processes - 1' to the ref and map count.
> >
> > In each of the above operations, we are holding the PTL lock (which is
> > really the split/PMD lock) so synchronization should not be an issue.
> >
> > Although I mention processes sharing the PMD above, it is really mappings/vmas
> > sharing the PMD. You could have two mappings of the same object in the same
> > process sharing PMDs.
> >
> > I'll code this up and see how it looks.
>
> Thanks!
>
> > However, unless you have an objection I would prefer the simple patches
> > move forward, especially for stable backports.
>
> Yes, the current patch is much simpler and more suitable for stable
> backports. If the explicit map count modifications are not all that
> terrible then this would sound like a more appropriate long term plan
> though.
The approach mentioned above seems to be simple enough. Patch is below.
I 'tested' with the same method and tests used to measure fault scalabilty
when developing vma based locking [1]. I figured this would be a good stress
of the share, unshare and fault paths. With the patch, we are doing more
with the page table lock held, so I expected to see a little difference
in scalability, but not as much as actually measured:
next-20230131
test instances unmodified patched
--------------------------------------------------------------------------
Combined faults 24 61888.4 58314.8
Combined forks 24 157.3 130.1
These tests could seem a bit like a micro-benchmark targeting these code
paths. However, I put them together based on the description of a
customer workload that prompted the vma based locking work. And, performance
of these tests seems to reflect performance of their workloads.
This extra overhead is the cost needed to make shared PMD map counts be
accurate and in line with what is normal and expected. I think it is
worth the cost. Other opinions? Of course, the patch below may have
issues so please take a look.
[1] https://lore.kernel.org/linux-mm/20220914221810.95771-1-mike.kravetz@oracle.com/
From bff5a717521f96b0e5075ac4b5a1ef84a3589b7e Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Mon, 30 Jan 2023 20:14:14 -0800
Subject: [PATCH] hugetlb: Adjust hugetlbp page ref/map counts for PMD sharing
When hugetlb PMDS are shared, the sharing code simply adds the shared
PMD to another processes page table. It will not update the ref/map
counts of pages referenced by the shared PMD. As a result, the ref/map
count will only reflect when the page was added to the shared PMD. Even
though the shared PMD may be in MANY process page tables, ref/map counts
on the pages will only appear to be that of a single process.
Update ref/map counts to take PMD sharing into account. This is done in
three distinct places:
1) At PMD share time in huge_pmd_share(),
Go through all entries in the PMD, and increment map and ref count for
all referenced pages. huge_pmd_share is just adding another use and
mapping of each page.
2) At PMD unshare time in huge_pmd_unshare(),
Go through all entries in the PMD, and decrement map and ref count for
all referenced pages. huge_pmd_unshare is just removing one use and
mapping of each page.
3) When faulting in a new hugetlb page,
Check if we are adding a new entry to a shared PMD. If yes, add
'num_of_sharing__processes - 1' to the ref and map count.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
mm/hugetlb.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 52 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3a01a9dbf445..c7b1c6307a82 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -96,6 +96,7 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+static void adjust_page_counts_for_shared_pmd(pte_t *ptep, struct folio *folio);
static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
@@ -5905,10 +5906,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (!pte_same(huge_ptep_get(ptep), old_pte))
goto backout;
- if (anon_rmap)
+ if (anon_rmap) {
hugepage_add_new_anon_rmap(folio, vma, haddr);
- else
+ } else {
page_dup_file_rmap(&folio->page, true);
+ adjust_page_counts_for_shared_pmd(ptep, folio);
+ }
new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
/*
@@ -7036,6 +7039,43 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
*end = ALIGN(*end, PUD_SIZE);
}
+static void adjust_page_counts_for_shared_pmd(pte_t *ptep, struct folio *folio)
+{
+ int shared_count = page_count(virt_to_page(ptep));
+
+ if (shared_count < 2)
+ return;
+
+ folio_ref_add(folio, shared_count - 1);
+ atomic_add(shared_count - 1, &folio->_entire_mapcount);
+}
+
+static void adjust_shared_pmd_page_counts(pmd_t *pmd_start, int delta)
+{
+ struct folio *folio;
+ struct page *page;
+ pte_t *ptep, pte;
+ int i;
+
+ for (i= 0; i < PTRS_PER_PMD; i++) {
+ ptep = (pte_t *)(pmd_start + i);
+
+ pte = huge_ptep_get(ptep);
+ if (huge_pte_none(pte) || !pte_present(pte))
+ continue;
+
+ page = pte_page(pte);
+ folio = (struct folio *)page;
+ if (delta > 0) {
+ folio_get(folio);
+ atomic_inc(&folio->_entire_mapcount);
+ } else {
+ folio_put(folio);
+ atomic_dec(&folio->_entire_mapcount);
+ }
+ }
+}
+
/*
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
@@ -7078,9 +7118,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
if (pud_none(*pud)) {
- pud_populate(mm, pud,
- (pmd_t *)((unsigned long)spte & PAGE_MASK));
+ pmd_t *pmdp = (pmd_t *)((unsigned long)spte & PAGE_MASK);
+
+ pud_populate(mm, pud, pmdp);
mm_inc_nr_pmds(mm);
+ adjust_shared_pmd_page_counts(pmdp, 1);
} else {
put_page(virt_to_page(spte));
}
@@ -7118,12 +7160,18 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
pud_clear(pud);
put_page(virt_to_page(ptep));
+ adjust_shared_pmd_page_counts(
+ (pmd_t *)((unsigned long)ptep & PAGE_MASK), -1);
mm_dec_nr_pmds(mm);
return 1;
}
#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
+static void adjust_page_counts_for_shared_pmd(pte_t *ptep, struct folio *folio)
+{
+}
+
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pud_t *pud)
{
--
2.39.1
next prev parent reply other threads:[~2023-02-01 21:06 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-26 22:27 [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Mike Kravetz
2023-01-26 22:27 ` [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps Mike Kravetz
2023-01-27 16:23 ` David Hildenbrand
2023-01-27 23:04 ` Andrew Morton
2023-01-28 1:12 ` Mike Kravetz
2023-01-30 12:36 ` Michal Hocko
2023-01-30 22:08 ` Mike Kravetz
2023-02-01 7:47 ` Michal Hocko
2023-02-01 21:05 ` Mike Kravetz [this message]
2023-02-03 13:40 ` Michal Hocko
2023-02-03 20:16 ` Mike Kravetz
2023-02-13 18:01 ` Michal Hocko
2023-02-14 1:40 ` Mike Kravetz
2023-01-26 22:27 ` [PATCH 2/2] migrate: hugetlb: Check for hugetlb shared PMD in node migration Mike Kravetz
2023-01-27 16:23 ` David Hildenbrand
2023-01-26 22:47 ` [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Andrew Morton
2023-01-26 22:48 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y9rUHw2kuSwg2ntI@monkey \
--to=mike.kravetz@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=jthoughton@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=naoya.horiguchi@linux.dev \
--cc=peterx@redhat.com \
--cc=shy828301@gmail.com \
--cc=songmuchun@bytedance.com \
--cc=stable@vger.kernel.org \
--cc=vishal.moola@gmail.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).