From: Thomas Gleixner <tglx@linutronix.de>
To: Mark Rutland <mark.rutland@arm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>,
Uladzislau Rezki <urezki@gmail.com>,
"Russell King (Oracle)" <linux@armlinux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm <linux-mm@kvack.org>, Christoph Hellwig <hch@lst.de>,
Lorenzo Stoakes <lstoakes@gmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
linux-arm-kernel@lists.infradead.org,
Marc Zyngier <maz@kernel.org>,
x86@kernel.org
Subject: Re: Excessive TLB flush ranges
Date: Wed, 17 May 2023 18:41:44 +0200 [thread overview]
Message-ID: <87bkii6hbr.ffs@tglx> (raw)
In-Reply-To: <ZGTngQhcJ19/dMbm@FVFF77S0Q05N.cambridge.arm.com>
On Wed, May 17 2023 at 15:43, Mark Rutland wrote:
> On Wed, May 17, 2023 at 12:31:04PM +0200, Thomas Gleixner wrote:
>> The way how arm/arm64 implement that in software is:
>>
>> magic_barrier1();
>> flush_range_with_magic_opcodes();
>> magic_barrier2();
>
> FWIW, on arm64 that sequence (for leaf entries only) is:
>
> /*
> * Make sure prior writes to the page table entries are visible to all
> * CPUs, so that *subsequent* page table walks will see the latest
> * values.
> *
> * This is roughly __smp_wmb().
> */
> dsb(ishst) // AKA magic_barrier1()
>
> /*
> * The "TLBI *IS, <addr>" instructions send a message to all other
> * CPUs, essentially saying "please start invalidating entries for
> * <addr>"
> *
> * The "TLBI *ALL*IS" instructions send a message to all other CPUs,
> * essentially saying "please start invalidating all entries".
> *
> * In theory, this could be for discontiguous ranges.
> */
> flush_range_with_magic_opcodes()
>
> /*
> * Wait for acknowledgement that all prior TLBIs have completed. This
> * also ensures that all accesses using those translations have also
> * completed.
> *
> * This waits for all relevant CPUs to acknowledge completion of any
> * prior TLBIs sent by this CPU.
> */
> dsb(ish) // AKA magic_barrier2()
> isb()
>
> So you can batch a bunch of "TLBI *IS, <addr>" with a single barrier for
> completion, or you can use a single "TLBI *ALL*IS" to invalidate everything.
>
> It can still be worth using the latter, as arm64 has done since commit:
>
> 05ac65305437e8ef ("arm64: fix soft lockup due to large tlb flush range")
>
> ... as for a large range, issuing a bunch of "TLBI *IS, <addr>" can take a
> while, and can require the recipient CPUs to do more work than they might have
> to do for a single "TLBI *ALL*IS".
And looking at the changelog and backtrace:
PC is at __cpu_flush_kern_tlb_range+0xc/0x40
LR is at __purge_vmap_area_lazy+0x28c/0x3ac
I'm willing to bet that this is exactly the same scenario of a direct
map + module area flush. That's the only one we found so far which
creates insanely large ranges.
The other effects of coalescing can still result in seriously oversized
flushs for just a couple of pages. The worst I've seen aside of that BPF
muck was a 'flush 2 pages' with an resulting range of ~3.8MB.
> The point at which invalidating everything is better depends on a number of
> factors (e.g. the impact of all CPUs needing to make new page table walks), and
> currently we have an arbitrary boundary where we choose to invalidate
> everything (which has been tweaked a bit over time); there isn't really a
> one-size-fits-all best answer.
I'm well aware of that :)
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Thomas Gleixner <tglx@linutronix.de>
To: Mark Rutland <mark.rutland@arm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>,
Uladzislau Rezki <urezki@gmail.com>,
"Russell King (Oracle)" <linux@armlinux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm <linux-mm@kvack.org>, Christoph Hellwig <hch@lst.de>,
Lorenzo Stoakes <lstoakes@gmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
linux-arm-kernel@lists.infradead.org,
Marc Zyngier <maz@kernel.org>,
x86@kernel.org
Subject: Re: Excessive TLB flush ranges
Date: Wed, 17 May 2023 18:41:44 +0200 [thread overview]
Message-ID: <87bkii6hbr.ffs@tglx> (raw)
In-Reply-To: <ZGTngQhcJ19/dMbm@FVFF77S0Q05N.cambridge.arm.com>
On Wed, May 17 2023 at 15:43, Mark Rutland wrote:
> On Wed, May 17, 2023 at 12:31:04PM +0200, Thomas Gleixner wrote:
>> The way how arm/arm64 implement that in software is:
>>
>> magic_barrier1();
>> flush_range_with_magic_opcodes();
>> magic_barrier2();
>
> FWIW, on arm64 that sequence (for leaf entries only) is:
>
> /*
> * Make sure prior writes to the page table entries are visible to all
> * CPUs, so that *subsequent* page table walks will see the latest
> * values.
> *
> * This is roughly __smp_wmb().
> */
> dsb(ishst) // AKA magic_barrier1()
>
> /*
> * The "TLBI *IS, <addr>" instructions send a message to all other
> * CPUs, essentially saying "please start invalidating entries for
> * <addr>"
> *
> * The "TLBI *ALL*IS" instructions send a message to all other CPUs,
> * essentially saying "please start invalidating all entries".
> *
> * In theory, this could be for discontiguous ranges.
> */
> flush_range_with_magic_opcodes()
>
> /*
> * Wait for acknowledgement that all prior TLBIs have completed. This
> * also ensures that all accesses using those translations have also
> * completed.
> *
> * This waits for all relevant CPUs to acknowledge completion of any
> * prior TLBIs sent by this CPU.
> */
> dsb(ish) // AKA magic_barrier2()
> isb()
>
> So you can batch a bunch of "TLBI *IS, <addr>" with a single barrier for
> completion, or you can use a single "TLBI *ALL*IS" to invalidate everything.
>
> It can still be worth using the latter, as arm64 has done since commit:
>
> 05ac65305437e8ef ("arm64: fix soft lockup due to large tlb flush range")
>
> ... as for a large range, issuing a bunch of "TLBI *IS, <addr>" can take a
> while, and can require the recipient CPUs to do more work than they might have
> to do for a single "TLBI *ALL*IS".
And looking at the changelog and backtrace:
PC is at __cpu_flush_kern_tlb_range+0xc/0x40
LR is at __purge_vmap_area_lazy+0x28c/0x3ac
I'm willing to bet that this is exactly the same scenario of a direct
map + module area flush. That's the only one we found so far which
creates insanely large ranges.
The other effects of coalescing can still result in seriously oversized
flushs for just a couple of pages. The worst I've seen aside of that BPF
muck was a 'flush 2 pages' with an resulting range of ~3.8MB.
> The point at which invalidating everything is better depends on a number of
> factors (e.g. the impact of all CPUs needing to make new page table walks), and
> currently we have an arbitrary boundary where we choose to invalidate
> everything (which has been tweaked a bit over time); there isn't really a
> one-size-fits-all best answer.
I'm well aware of that :)
Thanks,
tglx
next prev parent reply other threads:[~2023-05-17 16:42 UTC|newest]
Thread overview: 150+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-15 16:43 Excessive TLB flush ranges Thomas Gleixner
2023-05-15 16:43 ` Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 19:46 ` Thomas Gleixner
2023-05-15 19:46 ` Thomas Gleixner
2023-05-15 21:11 ` Thomas Gleixner
2023-05-15 21:11 ` Thomas Gleixner
2023-05-15 21:31 ` Russell King (Oracle)
2023-05-15 21:31 ` Russell King (Oracle)
2023-05-16 6:37 ` Thomas Gleixner
2023-05-16 6:37 ` Thomas Gleixner
2023-05-16 6:46 ` Thomas Gleixner
2023-05-16 6:46 ` Thomas Gleixner
2023-05-16 8:18 ` Thomas Gleixner
2023-05-16 8:18 ` Thomas Gleixner
2023-05-16 8:20 ` Thomas Gleixner
2023-05-16 8:20 ` Thomas Gleixner
2023-05-16 8:27 ` Russell King (Oracle)
2023-05-16 8:27 ` Russell King (Oracle)
2023-05-16 9:03 ` Thomas Gleixner
2023-05-16 9:03 ` Thomas Gleixner
2023-05-16 10:05 ` Baoquan He
2023-05-16 10:05 ` Baoquan He
2023-05-16 14:21 ` Thomas Gleixner
2023-05-16 14:21 ` Thomas Gleixner
2023-05-16 19:03 ` Thomas Gleixner
2023-05-16 19:03 ` Thomas Gleixner
2023-05-17 9:38 ` Thomas Gleixner
2023-05-17 9:38 ` Thomas Gleixner
2023-05-17 10:52 ` Baoquan He
2023-05-17 10:52 ` Baoquan He
2023-05-19 11:22 ` Thomas Gleixner
2023-05-19 11:22 ` Thomas Gleixner
2023-05-19 11:49 ` Baoquan He
2023-05-19 11:49 ` Baoquan He
2023-05-19 14:13 ` Thomas Gleixner
2023-05-19 14:13 ` Thomas Gleixner
2023-05-19 12:01 ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
2023-05-19 12:01 ` Baoquan He
2023-05-19 14:16 ` Thomas Gleixner
2023-05-19 14:16 ` Thomas Gleixner
2023-05-19 12:02 ` [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately Baoquan He
2023-05-19 12:02 ` Baoquan He
2023-05-19 12:03 ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
2023-05-19 12:03 ` Baoquan He
2023-05-19 14:17 ` Thomas Gleixner
2023-05-19 14:17 ` Thomas Gleixner
2023-05-19 18:38 ` Thomas Gleixner
2023-05-19 18:38 ` Thomas Gleixner
2023-05-19 23:46 ` Baoquan He
2023-05-19 23:46 ` Baoquan He
2023-05-21 23:10 ` Thomas Gleixner
2023-05-21 23:10 ` Thomas Gleixner
2023-05-22 11:21 ` Baoquan He
2023-05-22 11:21 ` Baoquan He
2023-05-22 12:02 ` Thomas Gleixner
2023-05-22 12:02 ` Thomas Gleixner
2023-05-22 14:34 ` Baoquan He
2023-05-22 14:34 ` Baoquan He
2023-05-22 20:21 ` Thomas Gleixner
2023-05-22 20:21 ` Thomas Gleixner
2023-05-22 20:44 ` Thomas Gleixner
2023-05-22 20:44 ` Thomas Gleixner
2023-05-23 9:35 ` Baoquan He
2023-05-23 9:35 ` Baoquan He
2023-05-19 13:49 ` Excessive TLB flush ranges Thomas Gleixner
2023-05-19 13:49 ` Thomas Gleixner
2023-05-16 8:21 ` Russell King (Oracle)
2023-05-16 8:21 ` Russell King (Oracle)
2023-05-16 8:19 ` Russell King (Oracle)
2023-05-16 8:19 ` Russell King (Oracle)
2023-05-16 8:44 ` Thomas Gleixner
2023-05-16 8:44 ` Thomas Gleixner
2023-05-16 8:48 ` Russell King (Oracle)
2023-05-16 8:48 ` Russell King (Oracle)
2023-05-16 12:09 ` Thomas Gleixner
2023-05-16 12:09 ` Thomas Gleixner
2023-05-16 13:42 ` Uladzislau Rezki
2023-05-16 13:42 ` Uladzislau Rezki
2023-05-16 14:38 ` Thomas Gleixner
2023-05-16 14:38 ` Thomas Gleixner
2023-05-16 15:01 ` Uladzislau Rezki
2023-05-16 15:01 ` Uladzislau Rezki
2023-05-16 17:04 ` Thomas Gleixner
2023-05-16 17:04 ` Thomas Gleixner
2023-05-17 11:26 ` Uladzislau Rezki
2023-05-17 11:26 ` Uladzislau Rezki
2023-05-17 11:58 ` Thomas Gleixner
2023-05-17 11:58 ` Thomas Gleixner
2023-05-17 12:15 ` Uladzislau Rezki
2023-05-17 12:15 ` Uladzislau Rezki
2023-05-17 16:32 ` Thomas Gleixner
2023-05-17 16:32 ` Thomas Gleixner
2023-05-19 10:01 ` Uladzislau Rezki
2023-05-19 10:01 ` Uladzislau Rezki
2023-05-19 14:56 ` Thomas Gleixner
2023-05-19 14:56 ` Thomas Gleixner
2023-05-19 15:14 ` Uladzislau Rezki
2023-05-19 15:14 ` Uladzislau Rezki
2023-05-19 16:32 ` Thomas Gleixner
2023-05-19 16:32 ` Thomas Gleixner
2023-05-19 17:02 ` Uladzislau Rezki
2023-05-19 17:02 ` Uladzislau Rezki
2023-05-16 17:56 ` Nadav Amit
2023-05-16 17:56 ` Nadav Amit
2023-05-16 19:32 ` Thomas Gleixner
2023-05-16 19:32 ` Thomas Gleixner
2023-05-17 0:23 ` Thomas Gleixner
2023-05-17 0:23 ` Thomas Gleixner
2023-05-17 1:23 ` Nadav Amit
2023-05-17 1:23 ` Nadav Amit
2023-05-17 10:31 ` Thomas Gleixner
2023-05-17 10:31 ` Thomas Gleixner
2023-05-17 11:47 ` Thomas Gleixner
2023-05-17 11:47 ` Thomas Gleixner
2023-05-17 22:41 ` Nadav Amit
2023-05-17 22:41 ` Nadav Amit
2023-05-17 14:43 ` Mark Rutland
2023-05-17 14:43 ` Mark Rutland
2023-05-17 16:41 ` Thomas Gleixner [this message]
2023-05-17 16:41 ` Thomas Gleixner
2023-05-17 22:57 ` Nadav Amit
2023-05-17 22:57 ` Nadav Amit
2023-05-19 11:49 ` Thomas Gleixner
2023-05-19 11:49 ` Thomas Gleixner
2023-05-17 12:12 ` Russell King (Oracle)
2023-05-17 12:12 ` Russell King (Oracle)
2023-05-17 23:14 ` Nadav Amit
2023-05-17 23:14 ` Nadav Amit
2023-05-15 18:17 ` Uladzislau Rezki
2023-05-15 18:17 ` Uladzislau Rezki
2023-05-16 2:26 ` Baoquan He
2023-05-16 2:26 ` Baoquan He
2023-05-16 6:40 ` Thomas Gleixner
2023-05-16 6:40 ` Thomas Gleixner
2023-05-16 8:07 ` Baoquan He
2023-05-16 8:07 ` Baoquan He
2023-05-16 8:10 ` Baoquan He
2023-05-16 8:10 ` Baoquan He
2023-05-16 8:45 ` Russell King (Oracle)
2023-05-16 8:45 ` Russell King (Oracle)
2023-05-16 9:13 ` Thomas Gleixner
2023-05-16 9:13 ` Thomas Gleixner
2023-05-16 8:54 ` Thomas Gleixner
2023-05-16 8:54 ` Thomas Gleixner
2023-05-16 9:48 ` Baoquan He
2023-05-16 9:48 ` Baoquan He
2023-05-15 20:02 ` Nadav Amit
2023-05-15 20:02 ` Nadav Amit
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bkii6hbr.ffs@tglx \
--to=tglx@linutronix.de \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=hch@lst.de \
--cc=jogness@linutronix.de \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-mm@kvack.org \
--cc=linux@armlinux.org.uk \
--cc=lstoakes@gmail.com \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=nadav.amit@gmail.com \
--cc=peterz@infradead.org \
--cc=urezki@gmail.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.