From: SeongJae Park <sj@kernel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: SeongJae Park <sj@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
"Liam R. Howlett" <howlett@gmail.com>,
David Hildenbrand <david@redhat.com>,
Shakeel Butt <shakeel.butt@linux.dev>,
Vlastimil Babka <vbabka@suse.cz>,
kernel-team@meta.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()
Date: Tue, 11 Mar 2025 13:58:01 -0700 [thread overview]
Message-ID: <20250311205801.85356-1-sj@kernel.org> (raw)
In-Reply-To: <6250fc68-2ce8-43a8-a064-e24877033ce1@lucifer.local>
On Tue, 11 Mar 2025 12:45:44 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote:
> > Some of zap_page_range_single() callers such as [process_]madvise() with
> > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > zap_page_range_single() does tlb flushing for each invocation. Split
> > out the body of zap_page_range_single() except mmu_gather object
> > initialization and gathered tlb entries flushing parts for such batched
> > tlb flushing usage.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> > mm/memory.c | 36 ++++++++++++++++++++++--------------
> > 1 file changed, 22 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 78c7ee62795e..88c478e2ed1a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > mmu_notifier_invalidate_range_end(&range);
> > }
> >
> > -/**
> > - * zap_page_range_single - remove user pages in a given range
> > - * @vma: vm_area_struct holding the applicable pages
> > - * @address: starting address of pages to zap
> > - * @size: number of bytes to zap
> > - * @details: details of shared cache invalidation
> > - *
> > - * The range must fit into one VMA.
> > - */
> > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +static void unmap_vma_single(struct mmu_gather *tlb,
> > + struct vm_area_struct *vma, unsigned long address,
> > unsigned long size, struct zap_details *details)
> > {
> > const unsigned long end = address + size;
> > struct mmu_notifier_range range;
> > - struct mmu_gather tlb;
> >
> > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> > address, end);
> > hugetlb_zap_begin(vma, &range.start, &range.end);
> > - tlb_gather_mmu(&tlb, vma->vm_mm);
> > update_hiwater_rss(vma->vm_mm);
> > mmu_notifier_invalidate_range_start(&range);
> > /*
> > * unmap 'address-end' not 'range.start-range.end' as range
> > * could have been expanded for hugetlb pmd sharing.
> > */
> > - unmap_single_vma(&tlb, vma, address, end, details, false);
> > + unmap_single_vma(tlb, vma, address, end, details, false);
> > mmu_notifier_invalidate_range_end(&range);
> > - tlb_finish_mmu(&tlb);
> > hugetlb_zap_end(vma, details);
>
> Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens
> before?
>
> This seems like a major problem with this change.
Oh, you're right. This could re-introduce the racy hugetlb allocation failure
problem that fixed by commit 2820b0f09be9 ("hugetlbfs: close race between
MADV_DONTNEED and page fault"). That is, this patch can make hugetlb
allocation failures increase while MADV_DONTNEED is going on.
Maybe a straightforward fix of the problem is doing hugetlb_zap_end() for all
vmas in a batched manner, similar to that for tlb flush. For example, add a
list or an array for the vmas in 'struct madvise_behavior', let
'unmap_vma_single()' adds each vma in there, and call hugetlb_zap_end() for
gathered vmas at vector_madvise() or do_madvise(). Does that make sense?
Also Cc-ing Rik, who is the author of the commit 2820b0f09be9 ("hugetlbfs:
close race between MADV_DONTNEED and page fault") for a case that I'm missing
something important.
> If not you need to explain why
> not in the commit message.
I now think it is a problem. If it turns out I'm wrong, I will of course add
the reason on the commit message.
Thanks,
SJ
[...]
next prev parent reply other threads:[~2025-03-11 20:58 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-10 17:23 [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE SeongJae Park
2025-03-10 17:23 ` [PATCH 1/9] mm/madvise: use is_memory_failure() from madvise_do_behavior() SeongJae Park
2025-03-11 11:27 ` Lorenzo Stoakes
2025-03-10 17:23 ` [PATCH 2/9] mm/madvise: split out populate behavior check logic SeongJae Park
2025-03-11 11:29 ` Lorenzo Stoakes
2025-03-10 17:23 ` [PATCH 3/9] mm/madvise: deduplicate madvise_do_behavior() skip case handlings SeongJae Park
2025-03-11 12:02 ` Lorenzo Stoakes
2025-03-11 20:54 ` SeongJae Park
2025-03-10 17:23 ` [PATCH 4/9] mm/madvise: remove len parameter of madvise_do_behavior() SeongJae Park
2025-03-11 12:05 ` Lorenzo Stoakes
2025-03-10 17:23 ` [PATCH 5/9] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() SeongJae Park
2025-03-11 12:17 ` Lorenzo Stoakes
2025-03-11 20:56 ` SeongJae Park
2025-03-12 5:47 ` Lorenzo Stoakes
2025-03-12 17:23 ` SeongJae Park
2025-03-10 17:23 ` [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single() SeongJae Park
2025-03-11 12:45 ` Lorenzo Stoakes
2025-03-11 20:58 ` SeongJae Park [this message]
2025-03-31 20:24 ` SeongJae Park
2025-04-01 1:45 ` Liam R. Howlett
2025-04-01 2:48 ` SeongJae Park
2025-04-01 14:03 ` Liam R. Howlett
2025-04-01 21:25 ` SeongJae Park
2025-03-10 17:23 ` [PATCH 7/9] mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches tlb flushes SeongJae Park
2025-03-11 13:07 ` Lorenzo Stoakes
2025-03-11 21:00 ` SeongJae Park
2025-03-10 17:23 ` [PATCH 8/9] mm/madvise: batch tlb flushes for [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) SeongJae Park
2025-03-11 13:59 ` Lorenzo Stoakes
2025-03-11 21:01 ` SeongJae Park
2025-04-01 21:17 ` SeongJae Park
2025-03-10 17:23 ` [PATCH 9/9] mm/madvise: remove !tlb support from madvise_{dontneed,free}_single_vma() SeongJae Park
2025-03-11 14:01 ` Lorenzo Stoakes
2025-03-11 21:02 ` SeongJae Park
2025-03-12 13:46 ` Lorenzo Stoakes
2025-04-01 21:22 ` SeongJae Park
2025-03-10 22:39 ` [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Andrew Morton
2025-03-10 23:15 ` Shakeel Butt
2025-03-10 23:36 ` Roman Gushchin
2025-03-11 11:17 ` Lorenzo Stoakes
2025-03-10 23:27 ` SeongJae Park
2025-03-11 12:49 ` Lorenzo Stoakes
2025-03-11 21:03 ` SeongJae Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250311205801.85356-1-sj@kernel.org \
--to=sj@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=howlett@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=riel@surriel.com \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.