Re: [PATCH v3 00/21] huge page clearing optimizations

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	the arch/x86 maintainers <x86@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Ingo Molnar <mingo@kernel.org>,
	Andrew Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Andi Kleen <ak@linux.intel.com>, Arnd Bergmann <arnd@arndb.de>,
	Jason Gunthorpe <jgg@nvidia.com>,
	jon.grimm@amd.com, Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Joao Martins <joao.m.martins@oracle.com>
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations
Date: Thu, 09 Jun 2022 00:54:20 +0530	[thread overview]
Message-ID: <877d5rt0uz.fsf@oracle.com> (raw)
In-Reply-To: <CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com>


Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> For highmem and page-at-a-time archs we would need to keep some
>> of the same optimizations (via the common clear/copy_user_highpages().)
>
> Yeah, I guess that we could keep the code for legacy use, just make
> the existing code be marked __weak so that it can be ignored for any
> further work.
>
> IOW, the first patch might be to just add that __weak to
> 'clear_huge_page()' and 'copy_user_huge_page()'.
>
> At that point, any architecture can just say "I will implement my own
> versions of these two".
>
> In fact, you can start with just one or the other, which is probably
> nicer to keep the patch series smaller (ie do the simpler
> "clear_huge_page()" first).

Agreed. Best way to iron out all the kinks too.

> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Looking at it now, it seems to be caused by the wide range of
MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try
to figure it out in detail.

But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
An arch specific clear_huge_page() code could, however handle 1GB pages
via some kind of static loop around (30 - MAX_SECTION_BITS).

I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP
configs. But, I think we should be able to not look up pfn_to_page(),
page_to_pfn() at all or at least not in the inner loop.

> It most definitely makes no sense when there is no highmem issues, and
> all those 'struct page' games should just be deleted (or at least
> relegated entirely to that "legacy __weak function" case so that sane
> situations don't need to care).

Yeah, I'm hoping to do exactly that.

> For that same HIGHMEM reason it's probably a good idea to limit the
> new case just to x86-64, and leave 32-bit x86 behind.

Ack that.

>> Right. Or doing the whole contiguous area in one or a few chunks
>> chunks, and then touching the faulting cachelines towards the end.
>
> Yeah, just add a prefetch for the 'addr_hint' part at the end.
>
>> > Maybe an architecture could do even more radical things like "let's
>> > just 'rep stos' for the whole area, but set a special thread flag that
>> > causes the interrupt return to break it up on return to kernel space".
>> > IOW, the "latency fix" might not even be about chunking it up, it
>> > might look more like our exception handling thing.
>>
>> When I was thinking about this earlier, I had a vague inkling of
>> setting a thread flag and defer writes to the last few cachelines
>> for just before returning to user-space.
>> Can you elaborate a little about what you are describing above?
>
> So 'process_huge_page()' (and the gigantic page case) does three very
> different things:
>
>  (a) that page chunking for highmem accesses
>
>  (b) the page access _ordering_ for the cache hinting reasons
>
>  (c) the chunking for _latency_ reasons
>
> and I think all of them are basically "bad legacy" reasons, in that
>
>  (a) HIGHMEM doesn't exist on sane architectures that we care about these days
>
>  (b) the cache hinting ordering makes no sense if you do non-temporal
> accesses (and might then be replaced by a possible "prefetch" at the
> end)
>
>  (c) the latency reasons still *do* exist, but only with PREEMPT_NONE
>
> So what I was alluding to with those "more radical approaches" was
> that PREEMPT_NONE case: we would probably still want to chunk things
> up for latency reasons and do that "cond_resched()" in  between
> chunks.

Thanks for the detail. That helps.

> Now, there are alternatives here:
>
>  (a) only override that existing disgusting (but tested) function when
> both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false
>
>  (b) do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         void *addr = page_address(page);
>     #ifdef CONFIG_PREEMPT_NONE
>         for (int i = 0; i < pages_per_huge_page; i++)
>             clear_page(addr, PAGE_SIZE);
>             cond_preempt();
>         }
>     #else
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         prefetch(addr_hint);
>     #endif
>     }

We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
as well, right? Either way, as you said earlier could chunk
up in bigger units than a single page.
(In the numbers I had posted earlier, chunking in units of upto 1MB
gave ~25% higher clearing BW. Don't think the microcode setup costs
are that high, but don't have a good explanation why.)

>  or (c), do that "more radical approach", where you do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         set_thread_flag(TIF_PREEMPT_ME);
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         clear_thread_flag(TIF_PREEMPT_ME);
>         prefetch(addr_hint);
>     }
>
> and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
> case and actually force preemption even on a non-preempt kernel.

I like this one. I'll try out (b) and (c) and see how the code shakes
out.

Just one minor point -- seems to me that the choice of nontemporal or
temporal might have to be based on a hint to clear_huge_page().

Basically the nontemporal path is only faster for
(pages_per_huge_page * PAGE_SIZE > LLC-size).

So in the page-fault path it might make sense to use the temporal
path (except for gigantic pages.) In the prefault path, nontemporal
might be better.

Thanks

--
ankur

next prev parent reply	other threads:[~2022-06-08 19:25 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 02/21] mm, huge-page: refactor process_subpage() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 03/21] clear_page: add generic clear_user_pages() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 05/21] mm/huge_page: generalize process_huge_page() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 06/21] x86/clear_page: add clear_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 07/21] x86/asm: add memset_movnti() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 08/21] perf bench: " Ankur Arora
2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
2022-06-10 22:11   ` Noah Goldstein
2022-06-10 22:15     ` Noah Goldstein
2022-06-12 11:18       ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 10/21] x86/asm: add clear_pages_clzero() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
2022-06-06 20:37 ` [PATCH v3 12/21] sparse: add address_space __incoherent Ankur Arora
2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
2022-06-08  0:01   ` Luc Van Oostenryck
2022-06-12 11:19     ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 17/21] clear_huge_page: use non-cached clearing Ankur Arora
2022-06-06 20:37 ` [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING Ankur Arora
2022-06-06 20:37 ` [PATCH v3 19/21] gup: hint non-caching if clearing large regions Ankur Arora
2022-06-06 20:37 ` [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
2022-06-07 15:08   ` Ankur Arora
2022-06-07 17:56     ` Linus Torvalds
2022-06-08 19:24       ` Ankur Arora [this message]
2022-06-08 19:39         ` Linus Torvalds
2022-06-08 20:21           ` Ankur Arora
2022-06-08 19:49       ` Matthew Wilcox
2022-06-08 19:51         ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877d5rt0uz.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=jon.grimm@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.