All of lore.kernel.org
 help / color / mirror / Atom feed
From: Usama Arif <usamaarif642@gmail.com>
To: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org
Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
	roman.gushchin@linux.dev, yuzhao@google.com, baohua@kernel.org,
	ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org,
	cerasuolodomenico@gmail.com, corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH 0/6] mm: split underutilized THPs
Date: Tue, 30 Jul 2024 18:22:46 +0100	[thread overview]
Message-ID: <01899bc3-1920-4ff2-a470-decd1c282e38@gmail.com> (raw)
In-Reply-To: <95ed1631-ff62-4627-8dc6-332096e673b4@redhat.com>



On 30/07/2024 17:11, David Hildenbrand wrote:
> On 30.07.24 17:19, Usama Arif wrote:
>>
>>
>> On 30/07/2024 16:14, Usama Arif wrote:
>>>
>>>
>>> On 30/07/2024 15:35, David Hildenbrand wrote:
>>>> On 30.07.24 14:45, Usama Arif wrote:
>>>>> The current upstream default policy for THP is always. However, Meta
>>>>> uses madvise in production as the current THP=always policy vastly
>>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>>> excessive memory pressure and premature OOM killing.
>>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>>> (i.e. you dont know when the collapse will happen), while production
>>>>> environments require predictable performance. If there is enough memory
>>>>> available, its better for both performance and predictability to have
>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>>> running out of memory.
>>>>>
>>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>>> not remapped, hence saving memory. This method avoids the downside of
>>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>>> having to use madvise.
>>>>>
>>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>>
>>>>>                               | THP=madvise |  THP=always   | THP=always
>>>>>                               |             |               | + shrinker series
>>>>>                               |             |               | + max_ptes_none=409
>>>>> -----------------------------------------------------------------------------
>>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>>> (over THP=madvise)          |             |               |
>>>>> -----------------------------------------------------------------------------
>>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>>> -----------------------------------------------------------------------------
>>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>>> (80%) zero filled filled pages will be split.
>>>>>
>>>>> To test out the patches, the below commands without the shrinker will
>>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>>> the shrinker:
>>>>>
>>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>> mkdir /sys/fs/cgroup/test
>>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>>> # each THP, i.e. vm-stride 50K.
>>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>>> # killer.
>>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>>> # of max_ptes_none value and kill stress.
>>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>>
>>>>> Patches 1-2 add back helper functions that were previously removed
>>>>> to operate on page lists (needed by patch 3).
>>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>>> waiting for page reclaim or migration.
>>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>>> subpages when splitting THP.
>>>>> Patches 6 adds support for THP shrinker.
>>>>>
>>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>>> originally done in
>>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>>> The THP shrinker in this series is significantly different than the
>>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>>> remap clean subpages is the same).)
>>>>
>>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>>>
>>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>>>
>>>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>>>
>>>
>>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
>>>
>>>     if (userfaultfd_armed(pvmw->vma)) {
>>>         newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>>>                            pvmw->vma->vm_page_prot));
>>>         ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
>>>         set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>>>     }
>>
>>
>> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.
> 
> I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following
> 
> 1) mmap(), enable THP
> 2) Migrate a bunch of pages from the source during precopy (writing to
>    the memory). Might end up creating THPs (during fault/khugepaged)
> 3) Register UFFD on the VMA
> 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA
> 5) Discard any pages that have been re-dirtied or not migrated yet
> 6) Migrate-on-demand any holes using uffd
> 
> 
> If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated).
> 
> I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/
> 
> 

Thanks for the example!

Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source.

If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive.

diff --git a/mm/migrate.c b/mm/migrate.c
index 2731ac20ff33..52aa4770fbed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -206,14 +206,10 @@ static bool try_to_unmap_unused(struct page_vma_mapped_walk *pvmw,
        if (dirty)
                return false;
 
-       pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
-
-       if (userfaultfd_armed(pvmw->vma)) {
-               newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
-                                              pvmw->vma->vm_page_prot));
-               ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
-               set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
-       }
+       newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+                                       pvmw->vma->vm_page_prot));
+       ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
+       set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
 
        dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
        return true;





  reply	other threads:[~2024-07-30 17:22 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
2024-07-30 12:45 ` [PATCH 1/6] Revert "memcg: remove mem_cgroup_uncharge_list()" Usama Arif
2024-07-30 12:45 ` [PATCH 2/6] Revert "mm: remove free_unref_page_list()" Usama Arif
2024-07-30 12:46 ` [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-07-30 15:14   ` David Hildenbrand
2024-08-04 19:02     ` Usama Arif
2024-08-05  9:00       ` David Hildenbrand
2024-08-06  9:58         ` Usama Arif
2024-07-30 12:46 ` [PATCH 4/6] mm: don't remap unused subpages " Usama Arif
2024-07-30 18:07   ` Rik van Riel
2024-07-31 17:08     ` Usama Arif
2024-07-30 12:46 ` [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages Usama Arif
2024-07-30 18:10   ` Rik van Riel
2024-08-01  4:45   ` kernel test robot
2024-08-06 22:02     ` Usama Arif
2024-07-30 12:46 ` [PATCH 6/6] mm: split underutilized THPs Usama Arif
2024-07-30 13:59   ` Randy Dunlap
2024-07-30 14:35 ` [PATCH 0/6] " David Hildenbrand
2024-07-30 15:14   ` Usama Arif
2024-07-30 15:19     ` Usama Arif
2024-07-30 16:11       ` David Hildenbrand
2024-07-30 17:22         ` Usama Arif [this message]
2024-07-30 20:25           ` David Hildenbrand
2024-07-31 17:01             ` Usama Arif
2024-07-31 17:51               ` David Hildenbrand
2024-07-31 20:41                 ` Usama Arif
2024-08-01  6:36                   ` David Hildenbrand
2024-08-04 23:04                     ` Usama Arif
2024-08-06 17:17                       ` Usama Arif
2024-08-06 17:30                         ` David Hildenbrand
2024-08-06 17:28                     ` Johannes Weiner
2024-08-06 17:33                       ` David Hildenbrand
2024-08-01  6:09 ` Yu Zhao
2024-08-01 15:47   ` David Hildenbrand
2024-08-04 21:54     ` Yu Zhao
2024-08-05  1:32       ` Rik van Riel
2024-08-05 19:51         ` Yu Zhao
2024-08-01 16:22   ` Usama Arif
2024-08-01 16:27     ` David Hildenbrand
2024-08-04 19:10       ` Usama Arif
2024-08-04 23:32       ` Yu Zhao
2024-08-04 23:23     ` Yu Zhao
2024-08-06 11:18       ` Usama Arif
2024-08-06 17:38   ` Johannes Weiner
2024-08-06 18:06     ` Yu Zhao
2024-08-06 19:54       ` Johannes Weiner
2024-08-06 20:53         ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01899bc3-1920-4ff2-a470-decd1c282e38@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=corbet@lwn.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.