All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Lance Yang <ioworker0@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>,
	Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	Chris Li <chrisl@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
Date: Wed, 13 Mar 2024 09:36:52 +0000	[thread overview]
Message-ID: <00a3ba1d-98e1-409b-ae6e-7fbcbdcd74d5@arm.com> (raw)
In-Reply-To: <CAGsJ_4wodFkL4YZ1iQveUjK6QL7sNajyayBq4hJ3-GPoWJ6foQ@mail.gmail.com>

On 13/03/2024 09:16, Barry Song wrote:
> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/03/2024 07:19, Barry Song wrote:
>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>> range. This change means that large folios will be maintained all the
>>>> way to swap storage. This both improves performance during swap-out, by
>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>> a separate series).
>>>>
>>>> Folios that are not fully mapped in the target range are still split,
>>>> but note that behavior is changed so that if the split fails for any
>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>> next pte in the range and continue work on the proceeding folios.
>>>> Previously any failure of this sort would cause the entire operation to
>>>> give up and no folios mapped at higher addresses were paged out or made
>>>> cold. Given large folios are becoming more common, this old behavior
>>>> would have likely lead to wasted opportunities.
>>>>
>>>> While we are at it, change the code that clears young from the ptes to
>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>> where the old approach would require unfolding/refolding and the new
>>>> approach can be done in place.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> This looks so much better than our initial RFC.
>>> Thank you for your excellent work!
>>
>> Thanks - its a team effort - I had your PoC and David's previous batching work
>> to use as a template.
>>
>>>
>>>> ---
>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>         LIST_HEAD(folio_list);
>>>>         bool pageout_anon_only_filter;
>>>>         unsigned int batch_count = 0;
>>>> +       int nr;
>>>>
>>>>         if (fatal_signal_pending(current))
>>>>                 return -EINTR;
>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                 return 0;
>>>>         flush_tlb_batched_pending(mm);
>>>>         arch_enter_lazy_mmu_mode();
>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>> +               nr = 1;
>>>>                 ptent = ptep_get(pte);
>>>>
>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                         continue;
>>>>
>>>>                 /*
>>>> -                * Creating a THP page is expensive so split it only if we
>>>> -                * are sure it's worth. Split it if we are only owner.
>>>> +                * If we encounter a large folio, only split it if it is not
>>>> +                * fully mapped within the range we are operating on. Otherwise
>>>> +                * leave it as is so that it can be swapped out whole. If we
>>>> +                * fail to split a folio, leave it in place and advance to the
>>>> +                * next pte in the range.
>>>>                  */
>>>>                 if (folio_test_large(folio)) {
>>>> -                       int err;
>>>> -
>>>> -                       if (folio_estimated_sharers(folio) > 1)
>>>> -                               break;
>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> -                               break;
>>>> -                       if (!folio_trylock(folio))
>>>> -                               break;
>>>> -                       folio_get(folio);
>>>> -                       arch_leave_lazy_mmu_mode();
>>>> -                       pte_unmap_unlock(start_pte, ptl);
>>>> -                       start_pte = NULL;
>>>> -                       err = split_folio(folio);
>>>> -                       folio_unlock(folio);
>>>> -                       folio_put(folio);
>>>> -                       if (err)
>>>> -                               break;
>>>> -                       start_pte = pte =
>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> -                       if (!start_pte)
>>>> -                               break;
>>>> -                       arch_enter_lazy_mmu_mode();
>>>> -                       pte--;
>>>> -                       addr -= PAGE_SIZE;
>>>> -                       continue;
>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> +                                            fpb_flags, NULL);
>>>
>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>> are doing madvise() on a portion of a large folio.
>>
>> Good idea. Something like this?:
>>
>>         if (pte_pfn(pte) == folio_pfn(folio)
> 
> what about
> 
> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
> 
>  just to account for cases where the user's end address falls within
> the middle of a large folio?

yes, even better. I'll add this for the next version.

> 
> 
> BTW, another minor issue is here:
> 
>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>                         batch_count = 0;
>                         if (need_resched()) {
>                                 arch_leave_lazy_mmu_mode();
>                                 pte_unmap_unlock(start_pte, ptl);
>                                 cond_resched();
>                                 goto restart;
>                         }
>                 }
> 
> We are increasing 1 for nr ptes, thus, we are holding PTL longer
> than small folios case? we used to increase 1 for each PTE.
> Does it matter?

I thought about that, but the vast majority of the work is per-folio, not
per-pte. So I concluded it would be best to continue to increment per-folio.

> 
>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                                      fpb_flags, NULL);
>>
>> If we are not mapping the first page of the folio, then it can't be a full
>> mapping, so no need to call folio_pte_batch(). Just split it.
>>
>>>
>>>> +
>>>> +                       if (nr < folio_nr_pages(folio)) {
>>>> +                               int err;
>>>> +
>>>> +                               if (folio_estimated_sharers(folio) > 1)
>>>> +                                       continue;
>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> +                                       continue;
>>>> +                               if (!folio_trylock(folio))
>>>> +                                       continue;
>>>> +                               folio_get(folio);
>>>> +                               arch_leave_lazy_mmu_mode();
>>>> +                               pte_unmap_unlock(start_pte, ptl);
>>>> +                               start_pte = NULL;
>>>> +                               err = split_folio(folio);
>>>> +                               folio_unlock(folio);
>>>> +                               folio_put(folio);
>>>> +                               if (err)
>>>> +                                       continue;
>>>> +                               start_pte = pte =
>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> +                               if (!start_pte)
>>>> +                                       break;
>>>> +                               arch_enter_lazy_mmu_mode();
>>>> +                               nr = 0;
>>>> +                               continue;
>>>> +                       }
>>>>                 }
>>>>
>>>>                 /*
>>>>                  * Do not interfere with other mappings of this folio and
>>>> -                * non-LRU folio.
>>>> +                * non-LRU folio. If we have a large folio at this point, we
>>>> +                * know it is fully mapped so if its mapcount is the same as its
>>>> +                * number of pages, it must be exclusive.
>>>>                  */
>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>> +               if (!folio_test_lru(folio) ||
>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>>                         continue;
>>>
>>> This looks so perfect and is exactly what I wanted to achieve.
>>>
>>>>
>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>                         continue;
>>>>
>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -
>>>> -               if (!pageout && pte_young(ptent)) {
>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> -                                                       tlb->fullmm);
>>>> -                       ptent = pte_mkold(ptent);
>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>> +               if (!pageout) {
>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>> +                       }
>>>
>>> This looks so smart. if it is not pageout, we have increased pte
>>> and addr here; so nr is 0 and we don't need to increase again in
>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>
>>> otherwise, nr won't be 0. so we will increase addr and
>>> pte by nr.
>>
>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>> madvise_free_pte_range().
>>
>>
>>>
>>>
>>>>                 }
>>>>
>>>>                 /*
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> Overall, LGTM,
>>>
>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Thanks!
>>
>>



  reply	other threads:[~2024-03-13  9:37 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-11 15:00 [PATCH v4 0/6] Swap-out mTHP without splitting Ryan Roberts
2024-03-11 15:00 ` [PATCH v4 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-03-11 15:00 ` [PATCH v4 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
2024-03-20 11:10   ` Ryan Roberts
2024-03-20 14:13     ` David Hildenbrand
2024-03-20 14:21       ` Ryan Roberts
2024-03-11 15:00 ` [PATCH v4 3/6] mm: swap: Simplify struct percpu_cluster Ryan Roberts
2024-03-12  7:52   ` Huang, Ying
2024-03-12  8:51     ` Ryan Roberts
2024-03-13  1:34       ` Huang, Ying
2024-03-11 15:00 ` [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
2024-03-12  7:51   ` Huang, Ying
2024-03-12  9:40     ` Ryan Roberts
2024-03-13  1:33       ` Huang, Ying
2024-03-20 12:22     ` Ryan Roberts
2024-03-21  4:39       ` Huang, Ying
2024-03-21 12:21         ` Ryan Roberts
2024-03-22  2:38           ` Can you help us on memory barrier usage? (was Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders) Huang, Ying
2024-03-22  9:23             ` Ryan Roberts
2024-03-25  3:20               ` Huang, Ying
2024-03-22 13:19             ` Chris Li
2024-03-23  2:11             ` Akira Yokosawa
2024-03-25  0:01               ` Paul E. McKenney
2024-03-25  3:16                 ` Huang, Ying
2024-03-26 17:08                   ` Ryan Roberts
2024-03-25  3:00               ` Huang, Ying
2024-03-22  2:39           ` [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders Huang, Ying
2024-03-22  9:39             ` Ryan Roberts
2024-03-11 15:00 ` [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
2024-03-11 22:30   ` Barry Song
2024-03-12  8:12     ` Ryan Roberts
2024-03-12  8:40       ` Barry Song
2024-03-15 10:43   ` David Hildenbrand
2024-03-15 10:49     ` Ryan Roberts
2024-03-15 11:12       ` David Hildenbrand
2024-03-15 11:38         ` Ryan Roberts
2024-03-18  2:16           ` Huang, Ying
2024-03-18 10:00             ` Yin, Fengwei
2024-03-18 10:05               ` David Hildenbrand
2024-03-18 15:35                 ` Ryan Roberts
2024-03-18 15:36                   ` Ryan Roberts
2024-03-19  2:20                   ` Yin Fengwei
2024-03-19 14:40                     ` Ryan Roberts
2024-03-19  2:31                 ` Yin Fengwei
2024-03-11 15:00 ` [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
2024-03-13  7:19   ` Barry Song
2024-03-13  9:03     ` Ryan Roberts
2024-03-13  9:16       ` Barry Song
2024-03-13  9:36         ` Ryan Roberts [this message]
2024-03-13 10:37           ` Barry Song
2024-03-13 11:08             ` Ryan Roberts
2024-03-13 11:37               ` Barry Song
2024-03-13 12:02                 ` Ryan Roberts
2024-03-13  9:19       ` Lance Yang
2024-03-13 14:02       ` Lance Yang
2024-03-20 13:49         ` Ryan Roberts
2024-03-20 14:35           ` Lance Yang
2024-03-20 17:38             ` Ryan Roberts
2024-03-21  1:38               ` Lance Yang
2024-03-21 13:38                 ` Ryan Roberts
2024-03-21 14:55                   ` Lance Yang
2024-03-21 15:24                     ` Ryan Roberts
2024-03-22  0:56                       ` Lance Yang
2024-03-15 10:35   ` David Hildenbrand
2024-03-15 10:55     ` Ryan Roberts
2024-03-15 11:13       ` David Hildenbrand
2024-03-20 13:57     ` Ryan Roberts
2024-03-20 14:09       ` David Hildenbrand
2024-03-12  8:01 ` [PATCH v4 0/6] Swap-out mTHP without splitting Huang, Ying
2024-03-12  8:49   ` Ryan Roberts
2024-03-12 13:56     ` Ryan Roberts
2024-03-13  1:15       ` Huang, Ying
2024-03-13  8:50         ` Ryan Roberts
2024-03-12  8:45 ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=00a3ba1d-98e1-409b-ae6e-7fbcbdcd74d5@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=shy828301@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.