* Excessive TLB flush ranges
@ 2023-05-15 16:43 Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
` (2 more replies)
0 siblings, 3 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-15 16:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Christoph Hellwig, Uladzislau Rezki, Lorenzo Stoakes,
Peter Zijlstra, Baoquan He, John Ogness, linux-arm-kernel,
Russell King, Mark Rutland, Marc Zyngier
Folks!
We're observing massive latencies and slowdowns on ARM32 machines due to
excessive TLB flush ranges.
Those can be observed when tearing down a process, which has a seccomp
BPF filter installed. ARM32 uses the vmalloc area for module space.
bpf_prog_free_deferred()
vfree()
_vm_unmap_aliases()
collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000
__purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)
va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 ( 0 pages)
va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 ( 0 pages)
va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 ( 0 pages)
va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 ( 2 pages)
va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 ( 0 pages)
va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 ( 0 pages)
va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 ( 342 pages)
flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)
Does 372106 flush operations where only 31 are useful
So for all architectures which lack a mechanism to do a full TLB flush
in flush_tlb_kernel_range() this takes ages (4-8ms) and slows down
realtime processes on the other CPUs by a factor of two and larger.
So while ARM32, CSKY, NIOS, PPC (some variants), _should_ arguably have
a fallback to tlb_flush_all() when the range is too large, there is
another issue. I've seen a couple of instances where _vm_unmap_aliases()
collects one page and the actual va list has only 2 pages, which might
be eventually worth to flush one by one.
I'm not sure whether that's worth it as checking for those gaps might be
too expensive for the case where a large number of va entries needs to
be flushed.
We'll experiment with a tlb_flush_all() fallback on that ARM32 system in
the next days and see how that works out.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 16:43 Excessive TLB flush ranges Thomas Gleixner
@ 2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 19:46 ` Thomas Gleixner
2023-05-15 18:17 ` Uladzislau Rezki
2023-05-15 20:02 ` Nadav Amit
2 siblings, 1 reply; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-15 16:59 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier
On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
> Folks!
>
> We're observing massive latencies and slowdowns on ARM32 machines due to
> excessive TLB flush ranges.
>
> Those can be observed when tearing down a process, which has a seccomp
> BPF filter installed. ARM32 uses the vmalloc area for module space.
>
> bpf_prog_free_deferred()
> vfree()
> _vm_unmap_aliases()
> collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000
> __purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)
>
> va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
> va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 ( 2 pages)
> va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 ( 342 pages)
>
> flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)
>
> Does 372106 flush operations where only 31 are useful
So, you asked the architecture to flush a large range, and are then
surprised if it takes a long time. There is no way to know how many
of those are useful.
Now, while using the sledge hammer of flushing all TLB entries may
sound like a good answer, if we're only evicting 31 entries, the
other entries are probably useful to have, no?
I think that you'd only run into this if you had a huge BPF
program and you tore it down, no?
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 16:43 Excessive TLB flush ranges Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
@ 2023-05-15 18:17 ` Uladzislau Rezki
2023-05-16 2:26 ` Baoquan He
2023-05-15 20:02 ` Nadav Amit
2 siblings, 1 reply; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-15 18:17 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Russell King, Mark Rutland, Marc Zyngier
On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
> Folks!
>
> We're observing massive latencies and slowdowns on ARM32 machines due to
> excessive TLB flush ranges.
>
> Those can be observed when tearing down a process, which has a seccomp
> BPF filter installed. ARM32 uses the vmalloc area for module space.
>
> bpf_prog_free_deferred()
> vfree()
> _vm_unmap_aliases()
> collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000
> __purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)
>
> va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
> va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 ( 2 pages)
> va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 ( 0 pages)
> va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 ( 342 pages)
>
> flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)
>
> Does 372106 flush operations where only 31 are useful
>
> So for all architectures which lack a mechanism to do a full TLB flush
> in flush_tlb_kernel_range() this takes ages (4-8ms) and slows down
> realtime processes on the other CPUs by a factor of two and larger.
>
> So while ARM32, CSKY, NIOS, PPC (some variants), _should_ arguably have
> a fallback to tlb_flush_all() when the range is too large, there is
> another issue. I've seen a couple of instances where _vm_unmap_aliases()
> collects one page and the actual va list has only 2 pages, which might
> be eventually worth to flush one by one.
>
> I'm not sure whether that's worth it as checking for those gaps might be
> too expensive for the case where a large number of va entries needs to
> be flushed.
>
> We'll experiment with a tlb_flush_all() fallback on that ARM32 system in
> the next days and see how that works out.
>
For systems which lack a full TLB flush and to flush a long range is
a problem(it takes time), probably we can flush VA one by one. Because
currently we calculate a flush range [min:max] and that range includes
the space that might not be mapped at all. Like below:
VA_1 VA_2
|....|-------------------------|............|
10 12 60 68
. mapped;
- not mapped.
so we flush from 10 until 68. Instead, probably we can do a flush of VA_1
range and VA_2 range. On modern systems with many CPUs, it could be a big
slow down.
Just some thoughts.
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 16:59 ` Russell King (Oracle)
@ 2023-05-15 19:46 ` Thomas Gleixner
2023-05-15 21:11 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-15 19:46 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Mon, May 15 2023 at 17:59, Russell King wrote:
> On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
>> bpf_prog_free_deferred()
>> vfree()
>> _vm_unmap_aliases()
>> collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000
>> __purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)
>>
>> va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
>> va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 ( 0 pages)
>> va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 ( 0 pages)
>> va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 ( 0 pages)
>> va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 ( 2 pages)
>> va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 ( 0 pages)
>> va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 ( 0 pages)
>> va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 ( 342 pages)
>>
>> flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)
>>
>> Does 372106 flush operations where only 31 are useful
>
> So, you asked the architecture to flush a large range, and are then
> surprised if it takes a long time. There is no way to know how many
> of those are useful.
I did not ask for that. That's the merge ranges logic in
__purge_vmap_area_lazy() which decides that the one page at 0x95c8d000
should build a flush range with the rest. I'm just the messenger :)
> Now, while using the sledge hammer of flushing all TLB entries may
> sound like a good answer, if we're only evicting 31 entries, the
> other entries are probably useful to have, no?
That's what I was asking already in the part you removed from the reply,
no?
> I think that you'd only run into this if you had a huge BPF
> program and you tore it down, no?
There was no huge BPF program. Some default seccomp muck.
I have another trace which shows that seccomp creates 10 BPF programs
for one process where each allocates 8K vmalloc memory in the
0xf0a.... address range.
On teardown this is even more horrible than the above. Every allocation
is deleted separately, i.e. 8k at a time and the pattern is always the
same. One extra page in the 0xca6..... address range is handed in via
_vm_unmap_aliases(), which expands the range insanely.
So this means ~1.5M flush operations to flush a total of 30 TLB entries.
That reproduces in a VM easily and has exactly the same behaviour:
Extra page[s] via The actual allocation
_vm_unmap_aliases() Pages Pages Flush start Pages
alloc: ffffc9000058e000 2
free : ffff888144751000 1 ffffc9000058e000 2 ffff888144751000 17312759359
alloc: ffffc90000595000 2
free : ffff8881424f0000 1 ffffc90000595000 2 ffff8881424f0000 17312768167
.....
seccomp seems to install 29 BPF programs for that process. So on exit()
this results in 29 full TLB flushes on x86, where each of them is used
to flush exactly three TLB entries.
The actual two page allocation (ffffc9...) is in the vmalloc space, the
extra page (ffff88...) is in the direct mapping.
This is a plain debian install with the a 6.4-rc1 kernel. The reproducer
is: # systemctl start logrotate
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 16:43 Excessive TLB flush ranges Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 18:17 ` Uladzislau Rezki
@ 2023-05-15 20:02 ` Nadav Amit
2 siblings, 0 replies; 75+ messages in thread
From: Nadav Amit @ 2023-05-15 20:02 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Russell King, Mark Rutland, Marc Zyngier
> On May 15, 2023, at 9:43 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> So while ARM32, CSKY, NIOS, PPC (some variants), _should_ arguably have
> a fallback to tlb_flush_all() when the range is too large, there is
> another issue. I've seen a couple of instances where _vm_unmap_aliases()
> collects one page and the actual va list has only 2 pages, which might
> be eventually worth to flush one by one.
>
> I'm not sure whether that's worth it as checking for those gaps might be
> too expensive for the case where a large number of va entries needs to
> be flushed.
I know you were cc’d, but for the record, here's a related discussion:
https://lore.kernel.org/all/CA+55aFwVUkdaf0_rBk7uJHQjWXu+OcLTHc6FKuCn0Cb2Kvg9NA@mail.gmail.com/
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 19:46 ` Thomas Gleixner
@ 2023-05-15 21:11 ` Thomas Gleixner
2023-05-15 21:31 ` Russell King (Oracle)
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-15 21:11 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Mon, May 15 2023 at 21:46, Thomas Gleixner wrote:
> On Mon, May 15 2023 at 17:59, Russell King wrote:
>> On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
> That reproduces in a VM easily and has exactly the same behaviour:
>
> Extra page[s] via The actual allocation
> _vm_unmap_aliases() Pages Pages Flush start Pages
> alloc: ffffc9000058e000 2
> free : ffff888144751000 1 ffffc9000058e000 2 ffff888144751000 17312759359
>
> alloc: ffffc90000595000 2
> free : ffff8881424f0000 1 ffffc90000595000 2 ffff8881424f0000 17312768167
>
> .....
>
> seccomp seems to install 29 BPF programs for that process. So on exit()
> this results in 29 full TLB flushes on x86, where each of them is used
> to flush exactly three TLB entries.
>
> The actual two page allocation (ffffc9...) is in the vmalloc space, the
> extra page (ffff88...) is in the direct mapping.
I tried to flush them one by one, which is actually slightly slower.
That's not surprising as there are 3 * 29 instead of 29 IPIs and the
IPIs dominate the picture.
But that's not necessarily true for ARM32 as there are no IPIs involved
on the machine we are using, which is a dual-core Cortex-A9.
So I came up with the hack below, which is equally fast as the full
flush variant while the performance impact on the other CPUs is minimally
lower according to perf.
That probably should have another argument which tells how many TLBs
this flush affects, i.e. 3 in this example, so an architecture can
sensibly decide whether it wants to use flush all or not.
Thanks,
tglx
---
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsig
unsigned int num_purged_areas = 0;
struct list_head local_purge_list;
struct vmap_area *va, *n_va;
+ struct vmap_area tmp = { .va_start = start, .va_end = end };
lockdep_assert_held(&vmap_purge_lock);
@@ -1747,7 +1748,12 @@ static bool __purge_vmap_area_lazy(unsig
list_last_entry(&local_purge_list,
struct vmap_area, list)->va_end);
- flush_tlb_kernel_range(start, end);
+ if (tmp.va_end > tmp.va_start)
+ list_add(&tmp.list, &local_purge_list);
+ flush_tlb_kernel_vas(&local_purge_list);
+ if (tmp.va_end > tmp.va_start)
+ list_del(&tmp.list);
+
resched_threshold = lazy_max_pages() << 1;
spin_lock(&free_vmap_area_lock);
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
#include <linux/debugfs.h>
#include <linux/sched/smt.h>
#include <linux/task_work.h>
+#include <linux/vmalloc.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1081,6 +1082,24 @@ void flush_tlb_kernel_range(unsigned lon
}
}
+static void do_flush_vas(void *arg)
+{
+ struct list_head *list = arg;
+ struct vmap_area *va;
+ unsigned long addr;
+
+ list_for_each_entry(va, list, list) {
+ /* flush range by one by one 'invlpg' */
+ for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
+ flush_tlb_one_kernel(addr);
+ }
+}
+
+void flush_tlb_kernel_vas(struct list_head *list)
+{
+ on_each_cpu(do_flush_vas, list, 1);
+}
+
/*
* This can be used from process context to figure out what the value of
* CR3 is without needing to do a (slow) __read_cr3().
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -295,4 +295,6 @@ bool vmalloc_dump_obj(void *object);
static inline bool vmalloc_dump_obj(void *object) { return false; }
#endif
+void flush_tlb_kernel_vas(struct list_head *list);
+
#endif /* _LINUX_VMALLOC_H */
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 21:11 ` Thomas Gleixner
@ 2023-05-15 21:31 ` Russell King (Oracle)
2023-05-16 6:37 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-15 21:31 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Mon, May 15, 2023 at 11:11:45PM +0200, Thomas Gleixner wrote:
> On Mon, May 15 2023 at 21:46, Thomas Gleixner wrote:
> > On Mon, May 15 2023 at 17:59, Russell King wrote:
> >> On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
> > That reproduces in a VM easily and has exactly the same behaviour:
> >
> > Extra page[s] via The actual allocation
> > _vm_unmap_aliases() Pages Pages Flush start Pages
> > alloc: ffffc9000058e000 2
> > free : ffff888144751000 1 ffffc9000058e000 2 ffff888144751000 17312759359
> >
> > alloc: ffffc90000595000 2
> > free : ffff8881424f0000 1 ffffc90000595000 2 ffff8881424f0000 17312768167
> >
> > .....
> >
> > seccomp seems to install 29 BPF programs for that process. So on exit()
> > this results in 29 full TLB flushes on x86, where each of them is used
> > to flush exactly three TLB entries.
> >
> > The actual two page allocation (ffffc9...) is in the vmalloc space, the
> > extra page (ffff88...) is in the direct mapping.
>
> I tried to flush them one by one, which is actually slightly slower.
> That's not surprising as there are 3 * 29 instead of 29 IPIs and the
> IPIs dominate the picture.
>
> But that's not necessarily true for ARM32 as there are no IPIs involved
> on the machine we are using, which is a dual-core Cortex-A9.
>
> So I came up with the hack below, which is equally fast as the full
> flush variant while the performance impact on the other CPUs is minimally
> lower according to perf.
>
> That probably should have another argument which tells how many TLBs
> this flush affects, i.e. 3 in this example, so an architecture can
> sensibly decide whether it wants to use flush all or not.
>
> Thanks,
>
> tglx
> ---
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsig
> unsigned int num_purged_areas = 0;
> struct list_head local_purge_list;
> struct vmap_area *va, *n_va;
> + struct vmap_area tmp = { .va_start = start, .va_end = end };
>
> lockdep_assert_held(&vmap_purge_lock);
>
> @@ -1747,7 +1748,12 @@ static bool __purge_vmap_area_lazy(unsig
> list_last_entry(&local_purge_list,
> struct vmap_area, list)->va_end);
>
> - flush_tlb_kernel_range(start, end);
> + if (tmp.va_end > tmp.va_start)
> + list_add(&tmp.list, &local_purge_list);
> + flush_tlb_kernel_vas(&local_purge_list);
> + if (tmp.va_end > tmp.va_start)
> + list_del(&tmp.list);
So basically we end up iterating over each VA range, which seems
sensible if the range is large and we have to iterate over it page
by page.
In the case you have, are "start" and "end" set on function entry
to a range, or are they set to ULONG_MAX,0 ? What I'm wondering is
whether we could get away with just having flush_tlb_kernel_vas().
Whether that's acceptable to others is a different question :)
> +
> resched_threshold = lazy_max_pages() << 1;
>
> spin_lock(&free_vmap_area_lock);
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -10,6 +10,7 @@
> #include <linux/debugfs.h>
> #include <linux/sched/smt.h>
> #include <linux/task_work.h>
> +#include <linux/vmalloc.h>
>
> #include <asm/tlbflush.h>
> #include <asm/mmu_context.h>
> @@ -1081,6 +1082,24 @@ void flush_tlb_kernel_range(unsigned lon
> }
> }
>
> +static void do_flush_vas(void *arg)
> +{
> + struct list_head *list = arg;
> + struct vmap_area *va;
> + unsigned long addr;
> +
> + list_for_each_entry(va, list, list) {
> + /* flush range by one by one 'invlpg' */
> + for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
> + flush_tlb_one_kernel(addr);
Isn't this just the same as:
flush_tlb_kernel_range(va->va_start, va->va_end);
at least on ARM32, it should be - the range will be iterated over in
assembly instead of C, although it'll be out of line but should be
slightly faster.
Thanks.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 18:17 ` Uladzislau Rezki
@ 2023-05-16 2:26 ` Baoquan He
2023-05-16 6:40 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-16 2:26 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Thomas Gleixner, Andrew Morton, linux-mm, Christoph Hellwig,
Lorenzo Stoakes, Peter Zijlstra, John Ogness, linux-arm-kernel,
Russell King, Mark Rutland, Marc Zyngier
On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
> On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
> > Folks!
> >
> > We're observing massive latencies and slowdowns on ARM32 machines due to
> > excessive TLB flush ranges.
> >
> > Those can be observed when tearing down a process, which has a seccomp
> > BPF filter installed. ARM32 uses the vmalloc area for module space.
> >
> > bpf_prog_free_deferred()
> > vfree()
> > _vm_unmap_aliases()
> > collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000
> > __purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)
> >
> > va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
> > va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 ( 0 pages)
> > va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 ( 0 pages)
> > va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 ( 0 pages)
> > va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 ( 2 pages)
> > va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 ( 0 pages)
> > va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 ( 0 pages)
> > va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 ( 342 pages)
> >
> > flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)
> >
> > Does 372106 flush operations where only 31 are useful
> >
> > So for all architectures which lack a mechanism to do a full TLB flush
> > in flush_tlb_kernel_range() this takes ages (4-8ms) and slows down
> > realtime processes on the other CPUs by a factor of two and larger.
> >
> > So while ARM32, CSKY, NIOS, PPC (some variants), _should_ arguably have
> > a fallback to tlb_flush_all() when the range is too large, there is
> > another issue. I've seen a couple of instances where _vm_unmap_aliases()
> > collects one page and the actual va list has only 2 pages, which might
> > be eventually worth to flush one by one.
> >
> > I'm not sure whether that's worth it as checking for those gaps might be
> > too expensive for the case where a large number of va entries needs to
> > be flushed.
> >
> > We'll experiment with a tlb_flush_all() fallback on that ARM32 system in
> > the next days and see how that works out.
> >
> For systems which lack a full TLB flush and to flush a long range is
> a problem(it takes time), probably we can flush VA one by one. Because
> currently we calculate a flush range [min:max] and that range includes
> the space that might not be mapped at all. Like below:
It's fine if we only calculate a flush range of [min:max] with VA. In
vm_reset_perms(), it calculates the flush range with the impacted direct
mapping range, then merge it with VA's range. That looks really strange
and surprising. If the vm->pages[] are got from a lower part of physical
memory, the final merged flush will span tremendous range. Wondering why
we need merge the direct map range with VA range, then do flush. Not
sure if I misunderstand it.
>
>
> VA_1 VA_2
> |....|-------------------------|............|
> 10 12 60 68
>
> . mapped;
> - not mapped.
>
> so we flush from 10 until 68. Instead, probably we can do a flush of VA_1
> range and VA_2 range. On modern systems with many CPUs, it could be a big
> slow down.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-15 21:31 ` Russell King (Oracle)
@ 2023-05-16 6:37 ` Thomas Gleixner
2023-05-16 6:46 ` Thomas Gleixner
` (2 more replies)
0 siblings, 3 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 6:37 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Mon, May 15 2023 at 22:31, Russell King wrote:
> On Mon, May 15, 2023 at 11:11:45PM +0200, Thomas Gleixner wrote:
>> But that's not necessarily true for ARM32 as there are no IPIs involved
>> on the machine we are using, which is a dual-core Cortex-A9.
>>
>> So I came up with the hack below, which is equally fast as the full
>> flush variant while the performance impact on the other CPUs is minimally
>> lower according to perf.
>>
>> That probably should have another argument which tells how many TLBs
>> this flush affects, i.e. 3 in this example, so an architecture can
>> sensibly decide whether it wants to use flush all or not.
>> @@ -1747,7 +1748,12 @@ static bool __purge_vmap_area_lazy(unsig
>> list_last_entry(&local_purge_list,
>> struct vmap_area, list)->va_end);
>>
>> - flush_tlb_kernel_range(start, end);
>> + if (tmp.va_end > tmp.va_start)
>> + list_add(&tmp.list, &local_purge_list);
>> + flush_tlb_kernel_vas(&local_purge_list);
>> + if (tmp.va_end > tmp.va_start)
>> + list_del(&tmp.list);
>
> So basically we end up iterating over each VA range, which seems
> sensible if the range is large and we have to iterate over it page
> by page.
Right.
> In the case you have, are "start" and "end" set on function entry
> to a range, or are they set to ULONG_MAX,0 ? What I'm wondering is
> whether we could get away with just having flush_tlb_kernel_vas().
>
> Whether that's acceptable to others is a different question :)
As I said flush_tlb_kernel_vas() should be
void flush_tlb_kernel_vas(struct list_head *list, unsigned int num_entries):
So that an architecture can decide whether it's worth to do walk the
entries or whether it resorts to a flush all.
>> +static void do_flush_vas(void *arg)
>> +{
>> + struct list_head *list = arg;
>> + struct vmap_area *va;
>> + unsigned long addr;
>> +
>> + list_for_each_entry(va, list, list) {
>> + /* flush range by one by one 'invlpg' */
>> + for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
>> + flush_tlb_one_kernel(addr);
>
> Isn't this just the same as:
> flush_tlb_kernel_range(va->va_start, va->va_end);
Indeed.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 2:26 ` Baoquan He
@ 2023-05-16 6:40 ` Thomas Gleixner
2023-05-16 8:07 ` Baoquan He
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 6:40 UTC (permalink / raw)
To: Baoquan He, Uladzislau Rezki
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Lorenzo Stoakes,
Peter Zijlstra, John Ogness, linux-arm-kernel, Russell King,
Mark Rutland, Marc Zyngier
On Tue, May 16 2023 at 10:26, Baoquan He wrote:
> On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
>> For systems which lack a full TLB flush and to flush a long range is
>> a problem(it takes time), probably we can flush VA one by one. Because
>> currently we calculate a flush range [min:max] and that range includes
>> the space that might not be mapped at all. Like below:
>
> It's fine if we only calculate a flush range of [min:max] with VA. In
> vm_reset_perms(), it calculates the flush range with the impacted direct
> mapping range, then merge it with VA's range. That looks really strange
> and surprising. If the vm->pages[] are got from a lower part of physical
> memory, the final merged flush will span tremendous range. Wondering why
> we need merge the direct map range with VA range, then do flush. Not
> sure if I misunderstand it.
So what happens on this BPF teardown is:
The vfree(8k) ends up flushing 3 entries. The actual vmalloc part (2) and
one extra which is in the direct map. I haven't verified that yet, but I
assume it's the alias of one of the vmalloc'ed pages.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 6:37 ` Thomas Gleixner
@ 2023-05-16 6:46 ` Thomas Gleixner
2023-05-16 8:18 ` Thomas Gleixner
2023-05-16 8:19 ` Russell King (Oracle)
2 siblings, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 6:46 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 08:37, Thomas Gleixner wrote:
> On Mon, May 15 2023 at 22:31, Russell King wrote:
>> In the case you have, are "start" and "end" set on function entry
>> to a range, or are they set to ULONG_MAX,0 ? What I'm wondering is
>> whether we could get away with just having flush_tlb_kernel_vas().
>>
>> Whether that's acceptable to others is a different question :)
>
> As I said flush_tlb_kernel_vas() should be
>
> void flush_tlb_kernel_vas(struct list_head *list, unsigned int num_entries):
>
> So that an architecture can decide whether it's worth to do walk the
> entries or whether it resorts to a flush all.
The only issue is that the flush range which is handed in from
_vm_unmap_aliases(), i.e. the conglomorate of to be mopped up TLBs is an
aggregate too. In this particular BPF case it's always one page, but
that obviously might end up being a horrible large range too.
Though there is no way to do that fake vmap_area trick I used in
__purge_vmap_area_lazy(). Bah!
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 6:40 ` Thomas Gleixner
@ 2023-05-16 8:07 ` Baoquan He
2023-05-16 8:10 ` Baoquan He
` (2 more replies)
0 siblings, 3 replies; 75+ messages in thread
From: Baoquan He @ 2023-05-16 8:07 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Andrew Morton, linux-mm, Christoph Hellwig,
Lorenzo Stoakes, Peter Zijlstra, John Ogness, linux-arm-kernel,
Russell King, Mark Rutland, Marc Zyngier
On 05/16/23 at 08:40am, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 10:26, Baoquan He wrote:
> > On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
> >> For systems which lack a full TLB flush and to flush a long range is
> >> a problem(it takes time), probably we can flush VA one by one. Because
> >> currently we calculate a flush range [min:max] and that range includes
> >> the space that might not be mapped at all. Like below:
> >
> > It's fine if we only calculate a flush range of [min:max] with VA. In
> > vm_reset_perms(), it calculates the flush range with the impacted direct
> > mapping range, then merge it with VA's range. That looks really strange
> > and surprising. If the vm->pages[] are got from a lower part of physical
> > memory, the final merged flush will span tremendous range. Wondering why
> > we need merge the direct map range with VA range, then do flush. Not
> > sure if I misunderstand it.
>
> So what happens on this BPF teardown is:
>
> The vfree(8k) ends up flushing 3 entries. The actual vmalloc part (2) and
> one extra which is in the direct map. I haven't verified that yet, but I
> assume it's the alias of one of the vmalloc'ed pages.
It looks like the reason. As Uladzislau pointed out, ARCH-es may
have full TLB flush, so won't get trouble from the merged flush
in the calculated [min:max] way, e.g arm64 and x86's flush_tlb_kernel_range().
However, arm32 seems lacking the ability of full TLB flash. If agreed, I
can make a draft patch to do the flush for direct map and VA seperately,
see if it works.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:07 ` Baoquan He
@ 2023-05-16 8:10 ` Baoquan He
2023-05-16 8:45 ` Russell King (Oracle)
2023-05-16 8:54 ` Thomas Gleixner
2 siblings, 0 replies; 75+ messages in thread
From: Baoquan He @ 2023-05-16 8:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Andrew Morton, linux-mm, Christoph Hellwig,
Lorenzo Stoakes, Peter Zijlstra, John Ogness, linux-arm-kernel,
Russell King, Mark Rutland, Marc Zyngier
On 05/16/23 at 04:07pm, Baoquan He wrote:
> On 05/16/23 at 08:40am, Thomas Gleixner wrote:
> > On Tue, May 16 2023 at 10:26, Baoquan He wrote:
> > > On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
> > >> For systems which lack a full TLB flush and to flush a long range is
> > >> a problem(it takes time), probably we can flush VA one by one. Because
> > >> currently we calculate a flush range [min:max] and that range includes
> > >> the space that might not be mapped at all. Like below:
> > >
> > > It's fine if we only calculate a flush range of [min:max] with VA. In
> > > vm_reset_perms(), it calculates the flush range with the impacted direct
> > > mapping range, then merge it with VA's range. That looks really strange
> > > and surprising. If the vm->pages[] are got from a lower part of physical
> > > memory, the final merged flush will span tremendous range. Wondering why
> > > we need merge the direct map range with VA range, then do flush. Not
> > > sure if I misunderstand it.
> >
> > So what happens on this BPF teardown is:
> >
> > The vfree(8k) ends up flushing 3 entries. The actual vmalloc part (2) and
> > one extra which is in the direct map. I haven't verified that yet, but I
> > assume it's the alias of one of the vmalloc'ed pages.
>
> It looks like the reason. As Uladzislau pointed out, ARCH-es may
> have full TLB flush, so won't get trouble from the merged flush
> in the calculated [min:max] way, e.g arm64 and x86's flush_tlb_kernel_range().
> However, arm32 seems lacking the ability of full TLB flash. If agreed, I
> can make a draft patch to do the flush for direct map and VA seperately,
> see if it works.
Ah, didn't notice there's draft patch under discussing in another
sub-thread, please ignore this mail. Will check that sub-thread.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 6:37 ` Thomas Gleixner
2023-05-16 6:46 ` Thomas Gleixner
@ 2023-05-16 8:18 ` Thomas Gleixner
2023-05-16 8:20 ` Thomas Gleixner
2023-05-16 8:21 ` Russell King (Oracle)
2023-05-16 8:19 ` Russell King (Oracle)
2 siblings, 2 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 8:18 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 08:37, Thomas Gleixner wrote:
> On Mon, May 15 2023 at 22:31, Russell King wrote:
>>> + list_for_each_entry(va, list, list) {
>>> + /* flush range by one by one 'invlpg' */
>>> + for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
>>> + flush_tlb_one_kernel(addr);
>>
>> Isn't this just the same as:
>> flush_tlb_kernel_range(va->va_start, va->va_end);
>
> Indeed.
Actually not. At least not on x86 where it'd end up with 3 IPIs for that
case again, instead of having one which walks the list on each CPU.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 6:37 ` Thomas Gleixner
2023-05-16 6:46 ` Thomas Gleixner
2023-05-16 8:18 ` Thomas Gleixner
@ 2023-05-16 8:19 ` Russell King (Oracle)
2023-05-16 8:44 ` Thomas Gleixner
2 siblings, 1 reply; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-16 8:19 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16, 2023 at 08:37:18AM +0200, Thomas Gleixner wrote:
> On Mon, May 15 2023 at 22:31, Russell King wrote:
> > On Mon, May 15, 2023 at 11:11:45PM +0200, Thomas Gleixner wrote:
> >> But that's not necessarily true for ARM32 as there are no IPIs involved
> >> on the machine we are using, which is a dual-core Cortex-A9.
> >>
> >> So I came up with the hack below, which is equally fast as the full
> >> flush variant while the performance impact on the other CPUs is minimally
> >> lower according to perf.
> >>
> >> That probably should have another argument which tells how many TLBs
> >> this flush affects, i.e. 3 in this example, so an architecture can
> >> sensibly decide whether it wants to use flush all or not.
> >> @@ -1747,7 +1748,12 @@ static bool __purge_vmap_area_lazy(unsig
> >> list_last_entry(&local_purge_list,
> >> struct vmap_area, list)->va_end);
> >>
> >> - flush_tlb_kernel_range(start, end);
> >> + if (tmp.va_end > tmp.va_start)
> >> + list_add(&tmp.list, &local_purge_list);
> >> + flush_tlb_kernel_vas(&local_purge_list);
> >> + if (tmp.va_end > tmp.va_start)
> >> + list_del(&tmp.list);
> >
> > So basically we end up iterating over each VA range, which seems
> > sensible if the range is large and we have to iterate over it page
> > by page.
>
> Right.
>
> > In the case you have, are "start" and "end" set on function entry
> > to a range, or are they set to ULONG_MAX,0 ? What I'm wondering is
> > whether we could get away with just having flush_tlb_kernel_vas().
> >
> > Whether that's acceptable to others is a different question :)
>
> As I said flush_tlb_kernel_vas() should be
>
> void flush_tlb_kernel_vas(struct list_head *list, unsigned int num_entries):
>
> So that an architecture can decide whether it's worth to do walk the
> entries or whether it resorts to a flush all.
Is "num_entries" what an arch would want to use? How would it use that?
It doesn't tell an arch whether there is a large range of many list
entries, or a single entry covering a large range.
Wouldn't passing "start" and "end" allow an arch to check for a small
range, and if the range is small, just call flush_tlb_kernel_range().
If the range is larger, then it can walk the list to decide whether
it makes sense to flush by list entry. If it deems that it would be
too much, then it could then flush all tlb entries.
The down-side to that approach is its more arch-side code of course.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:18 ` Thomas Gleixner
@ 2023-05-16 8:20 ` Thomas Gleixner
2023-05-16 8:27 ` Russell King (Oracle)
2023-05-16 8:21 ` Russell King (Oracle)
1 sibling, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 8:20 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 10:18, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 08:37, Thomas Gleixner wrote:
>> On Mon, May 15 2023 at 22:31, Russell King wrote:
>>>> + list_for_each_entry(va, list, list) {
>>>> + /* flush range by one by one 'invlpg' */
>>>> + for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
>>>> + flush_tlb_one_kernel(addr);
>>>
>>> Isn't this just the same as:
>>> flush_tlb_kernel_range(va->va_start, va->va_end);
>>
>> Indeed.
>
> Actually not. At least not on x86 where it'd end up with 3 IPIs for that
> case again, instead of having one which walks the list on each CPU.
ARM32 has the same problem when tlb_ops_need_broadcast() is true.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:18 ` Thomas Gleixner
2023-05-16 8:20 ` Thomas Gleixner
@ 2023-05-16 8:21 ` Russell King (Oracle)
1 sibling, 0 replies; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-16 8:21 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16, 2023 at 10:18:36AM +0200, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 08:37, Thomas Gleixner wrote:
> > On Mon, May 15 2023 at 22:31, Russell King wrote:
> >>> + list_for_each_entry(va, list, list) {
> >>> + /* flush range by one by one 'invlpg' */
> >>> + for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
> >>> + flush_tlb_one_kernel(addr);
> >>
> >> Isn't this just the same as:
> >> flush_tlb_kernel_range(va->va_start, va->va_end);
> >
> > Indeed.
>
> Actually not. At least not on x86 where it'd end up with 3 IPIs for that
> case again, instead of having one which walks the list on each CPU.
So what the best thing to do here is up to the arch implementation.
Maybe generic code shouldn't be making the decision here about
whether to call flush_tlb_one_kernel() or flush_tlb_kernel_range().
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:20 ` Thomas Gleixner
@ 2023-05-16 8:27 ` Russell King (Oracle)
2023-05-16 9:03 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-16 8:27 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16, 2023 at 10:20:37AM +0200, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 10:18, Thomas Gleixner wrote:
>
> > On Tue, May 16 2023 at 08:37, Thomas Gleixner wrote:
> >> On Mon, May 15 2023 at 22:31, Russell King wrote:
> >>>> + list_for_each_entry(va, list, list) {
> >>>> + /* flush range by one by one 'invlpg' */
> >>>> + for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
> >>>> + flush_tlb_one_kernel(addr);
> >>>
> >>> Isn't this just the same as:
> >>> flush_tlb_kernel_range(va->va_start, va->va_end);
> >>
> >> Indeed.
> >
> > Actually not. At least not on x86 where it'd end up with 3 IPIs for that
> > case again, instead of having one which walks the list on each CPU.
>
> ARM32 has the same problem when tlb_ops_need_broadcast() is true.
If tlb_ops_need_broadcast() is true, then isn't it one IPI to other
CPUs to flush the range, and possibly another for the Cortex-A15
erratum?
I've no idea what flush_tlb_one_kernel() is. I can find no such
implementation, there is flush_tlb_kernel_page() though, which I
think is what you're referring to above. On ARM32, that will issue
one IPI each time it's called, and possibly another IPI for the
Cortex-A15 erratum.
Given that, flush_tlb_kernel_range() is still going to be more
efficient on ARM32 when tlb_ops_need_broadcast() is true than doing
it page by page.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:19 ` Russell King (Oracle)
@ 2023-05-16 8:44 ` Thomas Gleixner
2023-05-16 8:48 ` Russell King (Oracle)
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 8:44 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 09:19, Russell King wrote:
> On Tue, May 16, 2023 at 08:37:18AM +0200, Thomas Gleixner wrote:
>> void flush_tlb_kernel_vas(struct list_head *list, unsigned int num_entries):
>>
>> So that an architecture can decide whether it's worth to do walk the
>> entries or whether it resorts to a flush all.
>
> Is "num_entries" what an arch would want to use? How would it use that?
> It doesn't tell an arch whether there is a large range of many list
> entries, or a single entry covering a large range.
Does it matter?
The total number of entries to flush is what accumulates and at some
architecture specific threshold that becomes more expensive than a full
flush, independent of the range of the individual list entries, no?
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:07 ` Baoquan He
2023-05-16 8:10 ` Baoquan He
@ 2023-05-16 8:45 ` Russell King (Oracle)
2023-05-16 9:13 ` Thomas Gleixner
2023-05-16 8:54 ` Thomas Gleixner
2 siblings, 1 reply; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-16 8:45 UTC (permalink / raw)
To: Baoquan He
Cc: Thomas Gleixner, Uladzislau Rezki, Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier
On Tue, May 16, 2023 at 04:07:17PM +0800, Baoquan He wrote:
> On 05/16/23 at 08:40am, Thomas Gleixner wrote:
> > On Tue, May 16 2023 at 10:26, Baoquan He wrote:
> > > On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
> > >> For systems which lack a full TLB flush and to flush a long range is
> > >> a problem(it takes time), probably we can flush VA one by one. Because
> > >> currently we calculate a flush range [min:max] and that range includes
> > >> the space that might not be mapped at all. Like below:
> > >
> > > It's fine if we only calculate a flush range of [min:max] with VA. In
> > > vm_reset_perms(), it calculates the flush range with the impacted direct
> > > mapping range, then merge it with VA's range. That looks really strange
> > > and surprising. If the vm->pages[] are got from a lower part of physical
> > > memory, the final merged flush will span tremendous range. Wondering why
> > > we need merge the direct map range with VA range, then do flush. Not
> > > sure if I misunderstand it.
> >
> > So what happens on this BPF teardown is:
> >
> > The vfree(8k) ends up flushing 3 entries. The actual vmalloc part (2) and
> > one extra which is in the direct map. I haven't verified that yet, but I
> > assume it's the alias of one of the vmalloc'ed pages.
>
> It looks like the reason. As Uladzislau pointed out, ARCH-es may
> have full TLB flush, so won't get trouble from the merged flush
> in the calculated [min:max] way, e.g arm64 and x86's flush_tlb_kernel_range().
> However, arm32 seems lacking the ability of full TLB flash. If agreed, I
> can make a draft patch to do the flush for direct map and VA seperately,
> see if it works.
The question IMHO is not so much whether there's a full-TLB flush
available, but whether it is appropriate to use it. If we're only
wanting to flush a small number of TLB entries but over a sparse
range (which seems to be Thomas' situation), does it make any sense
to flush all TLB entries? I don't think it does, but it depends
how often this occurs. If we're doing it on a regular basis because
of some workload, then that workload suffers. If it's a rare event
then maybe that's okay to do.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:44 ` Thomas Gleixner
@ 2023-05-16 8:48 ` Russell King (Oracle)
2023-05-16 12:09 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-16 8:48 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16, 2023 at 10:44:07AM +0200, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 09:19, Russell King wrote:
> > On Tue, May 16, 2023 at 08:37:18AM +0200, Thomas Gleixner wrote:
> >> void flush_tlb_kernel_vas(struct list_head *list, unsigned int num_entries):
> >>
> >> So that an architecture can decide whether it's worth to do walk the
> >> entries or whether it resorts to a flush all.
> >
> > Is "num_entries" what an arch would want to use? How would it use that?
> > It doesn't tell an arch whether there is a large range of many list
> > entries, or a single entry covering a large range.
>
> Does it matter?
>
> The total number of entries to flush is what accumulates and at some
> architecture specific threshold that becomes more expensive than a full
> flush, independent of the range of the individual list entries, no?
It depends what you mean by "num_entries" - is that the number of
pages to be flushed in total in the range?
If so, what does a valid "start" and "end" range passed to
__purge_vmap_area_lazy() mean for num_entries - does that go to
(end - start) / PAGE_SIZE, or would it still be restricted to the
sum of that per list entry? If so, what's the point of passing in
"start" and "end" to this function?
I'm not familiar with this code...
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:07 ` Baoquan He
2023-05-16 8:10 ` Baoquan He
2023-05-16 8:45 ` Russell King (Oracle)
@ 2023-05-16 8:54 ` Thomas Gleixner
2023-05-16 9:48 ` Baoquan He
2 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 8:54 UTC (permalink / raw)
To: Baoquan He
Cc: Uladzislau Rezki, Andrew Morton, linux-mm, Christoph Hellwig,
Lorenzo Stoakes, Peter Zijlstra, John Ogness, linux-arm-kernel,
Russell King, Mark Rutland, Marc Zyngier
On Tue, May 16 2023 at 16:07, Baoquan He wrote:
> On 05/16/23 at 08:40am, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 10:26, Baoquan He wrote:
>> > On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
>> >> For systems which lack a full TLB flush and to flush a long range is
>> >> a problem(it takes time), probably we can flush VA one by one. Because
>> >> currently we calculate a flush range [min:max] and that range includes
>> >> the space that might not be mapped at all. Like below:
>> >
>> > It's fine if we only calculate a flush range of [min:max] with VA. In
>> > vm_reset_perms(), it calculates the flush range with the impacted direct
>> > mapping range, then merge it with VA's range. That looks really strange
>> > and surprising. If the vm->pages[] are got from a lower part of physical
>> > memory, the final merged flush will span tremendous range. Wondering why
>> > we need merge the direct map range with VA range, then do flush. Not
>> > sure if I misunderstand it.
>>
>> So what happens on this BPF teardown is:
>>
>> The vfree(8k) ends up flushing 3 entries. The actual vmalloc part (2) and
>> one extra which is in the direct map. I haven't verified that yet, but I
>> assume it's the alias of one of the vmalloc'ed pages.
>
> It looks like the reason. As Uladzislau pointed out, ARCH-es may
> have full TLB flush, so won't get trouble from the merged flush
> in the calculated [min:max] way, e.g arm64 and x86's flush_tlb_kernel_range().
> However, arm32 seems lacking the ability of full TLB flash.
ARM has a full flush, but it does not check for that in
flush_tlb_kernel_range().
> If agreed, I can make a draft patch to do the flush for direct map and
> VA seperately, see if it works.
Of course it works. Already done that.
But you are missing the point. Look at the examples I provided.
The current implementation ends up doing a full flush on x86 just to
flush 3 TLB entries. For the very same reason because the flush range
(start..end) becomes insanely large due to the direct map and vmalloc
parts.
But doing indivudual flushes for direct map and vmalloc space is silly
too because then it ends up doing two IPIs instead of one. IPIs are
expensive and the whole point of coalescing the flushes is to spare
IPIs, no?
So with my hacked up flush_tlb_kernel_vas() I end up having exactly
_one_ IPI which walks the list and flushes the 3 TLB entries.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:27 ` Russell King (Oracle)
@ 2023-05-16 9:03 ` Thomas Gleixner
2023-05-16 10:05 ` Baoquan He
2023-05-19 13:49 ` Excessive TLB flush ranges Thomas Gleixner
0 siblings, 2 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 9:03 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 09:27, Russell King wrote:
> On Tue, May 16, 2023 at 10:20:37AM +0200, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 10:18, Thomas Gleixner wrote:
>>
>> > On Tue, May 16 2023 at 08:37, Thomas Gleixner wrote:
>> >> On Mon, May 15 2023 at 22:31, Russell King wrote:
>> >>>> + list_for_each_entry(va, list, list) {
>> >>>> + /* flush range by one by one 'invlpg' */
>> >>>> + for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
>> >>>> + flush_tlb_one_kernel(addr);
>> >>>
>> >>> Isn't this just the same as:
>> >>> flush_tlb_kernel_range(va->va_start, va->va_end);
>> >>
>> >> Indeed.
>> >
>> > Actually not. At least not on x86 where it'd end up with 3 IPIs for that
>> > case again, instead of having one which walks the list on each CPU.
>>
>> ARM32 has the same problem when tlb_ops_need_broadcast() is true.
>
> If tlb_ops_need_broadcast() is true, then isn't it one IPI to other
> CPUs to flush the range, and possibly another for the Cortex-A15
> erratum?
>
> I've no idea what flush_tlb_one_kernel() is. I can find no such
The patch is against x86 and that function exists there. At lease git
grep claims so. :)
> implementation, there is flush_tlb_kernel_page() though, which I
> think is what you're referring to above. On ARM32, that will issue
> one IPI each time it's called, and possibly another IPI for the
> Cortex-A15 erratum.
>
> Given that, flush_tlb_kernel_range() is still going to be more
> efficient on ARM32 when tlb_ops_need_broadcast() is true than doing
> it page by page.
Something like the untested below?
I did not attempt anything to decide whether a full flush might be worth
it, but that's a separate problem.
Thanks,
tglx
---
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -270,6 +270,10 @@ config ARCH_HAS_SET_MEMORY
config ARCH_HAS_SET_DIRECT_MAP
bool
+# Select if architecture provides flush_tlb_kernel_vas()
+config ARCH_HAS_FLUSH_TLB_KERNEL_VAS
+ bool
+
#
# Select if the architecture provides the arch_dma_set_uncached symbol to
# either provide an uncached segment alias for a DMA allocation, or
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -10,6 +10,7 @@ config ARM
select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
+ select ARCH_HAS_FLUSH_TLB_KERNEL_VAS
select ARCH_HAS_KEEPINITRD
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_SYNC_CORE
--- a/arch/arm/kernel/smp_tlb.c
+++ b/arch/arm/kernel/smp_tlb.c
@@ -7,6 +7,7 @@
#include <linux/preempt.h>
#include <linux/smp.h>
#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
#include <asm/smp_plat.h>
#include <asm/tlbflush.h>
@@ -69,6 +70,19 @@ static inline void ipi_flush_tlb_kernel_
local_flush_tlb_kernel_range(ta->ta_start, ta->ta_end);
}
+static inline void local_flush_tlb_kernel_vas(struct list_head *vmap_list)
+{
+ struct vmap_area *va;
+
+ list_for_each_entry(va, vmap_list, list)
+ local_flush_tlb_kernel_range(va->va_start, va->va_end);
+}
+
+static inline void ipi_flush_tlb_kernel_vas(void *arg)
+{
+ local_flush_tlb_kernel_vas(arg);
+}
+
static inline void ipi_flush_bp_all(void *ignored)
{
local_flush_bp_all();
@@ -244,6 +258,15 @@ void flush_tlb_kernel_range(unsigned lon
broadcast_tlb_a15_erratum();
}
+void flush_tlb_kernel_vas(struct list_head *vmap_list, unsigned long num_entries)
+{
+ if (tlb_ops_need_broadcast()) {
+ on_each_cpu(ipi_flush_tlb_kernel_vas, vmap_list, 1);
+ } else
+ local_flush_tlb_kernel_vas(vmap_list);
+ broadcast_tlb_a15_erratum();
+}
+
void flush_bp_all(void)
{
if (tlb_ops_need_broadcast())
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -77,6 +77,7 @@ config X86
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_EARLY_DEBUG if KGDB
select ARCH_HAS_ELF_RANDOMIZE
+ select ARCH_HAS_FLUSH_TLB_KERNEL_VAS
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
#include <linux/debugfs.h>
#include <linux/sched/smt.h>
#include <linux/task_work.h>
+#include <linux/vmalloc.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1081,6 +1082,27 @@ void flush_tlb_kernel_range(unsigned lon
}
}
+static void do_flush_tlb_vas(void *arg)
+{
+ struct list_head *vmap_list = arg;
+ struct vmap_area *va;
+ unsigned long addr;
+
+ list_for_each_entry(va, vmap_list, list) {
+ /* flush range by one by one 'invlpg' */
+ for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
+ flush_tlb_one_kernel(addr);
+ }
+}
+
+void flush_tlb_kernel_vas(struct list_head *vmap_list, unsigned long num_entries)
+{
+ if (num_entries > tlb_single_page_flush_ceiling)
+ on_each_cpu(do_flush_tlb_all, NULL, 1);
+ else
+ on_each_cpu(do_flush_tlb_vas, vmap_list, 1);
+}
+
/*
* This can be used from process context to figure out what the value of
* CR3 is without needing to do a (slow) __read_cr3().
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -295,4 +295,6 @@ bool vmalloc_dump_obj(void *object);
static inline bool vmalloc_dump_obj(void *object) { return false; }
#endif
+void flush_tlb_kernel_vas(struct list_head *list, unsigned long num_entries);
+
#endif /* _LINUX_VMALLOC_H */
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1724,7 +1724,8 @@ static void purge_fragmented_blocks_allc
*/
static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
{
- unsigned long resched_threshold;
+ unsigned long resched_threshold, num_entries = 0, num_alias_entries = 0;
+ struct vmap_area alias_va = { .va_start = start, .va_end = end };
unsigned int num_purged_areas = 0;
struct list_head local_purge_list;
struct vmap_area *va, *n_va;
@@ -1736,18 +1737,29 @@ static bool __purge_vmap_area_lazy(unsig
list_replace_init(&purge_vmap_area_list, &local_purge_list);
spin_unlock(&purge_vmap_area_lock);
- if (unlikely(list_empty(&local_purge_list)))
- goto out;
+ start = min(start, list_first_entry(&local_purge_list, struct vmap_area, list)->va_start);
+ end = max(end, list_last_entry(&local_purge_list, struct vmap_area, list)->va_end);
- start = min(start,
- list_first_entry(&local_purge_list,
- struct vmap_area, list)->va_start);
-
- end = max(end,
- list_last_entry(&local_purge_list,
- struct vmap_area, list)->va_end);
+ if (IS_ENABLED(CONFIG_HAVE_FLUSH_TLB_KERNEL_VAS)) {
+ list_for_each_entry(va, &local_purge_list, list)
+ num_entries += (va->va_end - va->va_start) >> PAGE_SHIFT;
+
+ if (unlikely(!num_entries))
+ goto out;
+
+ if (alias_va.va_end > alias_va.va_start) {
+ num_alias_entries = (alias_va.va_end - alias_va.va_start) >> PAGE_SHIFT;
+ list_add(&alias_va.list, &local_purge_list);
+ }
+
+ flush_tlb_kernel_vas(&local_purge_list, num_entries + num_alias_entries);
+
+ if (num_alias_entries)
+ list_del(&alias_va.list);
+ } else {
+ flush_tlb_kernel_range(start, end);
+ }
- flush_tlb_kernel_range(start, end);
resched_threshold = lazy_max_pages() << 1;
spin_lock(&free_vmap_area_lock);
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:45 ` Russell King (Oracle)
@ 2023-05-16 9:13 ` Thomas Gleixner
0 siblings, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 9:13 UTC (permalink / raw)
To: Russell King (Oracle), Baoquan He
Cc: Uladzislau Rezki, Andrew Morton, linux-mm, Christoph Hellwig,
Lorenzo Stoakes, Peter Zijlstra, John Ogness, linux-arm-kernel,
Mark Rutland, Marc Zyngier
On Tue, May 16 2023 at 09:45, Russell King wrote:
> On Tue, May 16, 2023 at 04:07:17PM +0800, Baoquan He wrote:
>> It looks like the reason. As Uladzislau pointed out, ARCH-es may
>> have full TLB flush, so won't get trouble from the merged flush
>> in the calculated [min:max] way, e.g arm64 and x86's flush_tlb_kernel_range().
>> However, arm32 seems lacking the ability of full TLB flash. If agreed, I
>> can make a draft patch to do the flush for direct map and VA seperately,
>> see if it works.
>
> The question IMHO is not so much whether there's a full-TLB flush
> available, but whether it is appropriate to use it. If we're only
> wanting to flush a small number of TLB entries but over a sparse
> range (which seems to be Thomas' situation), does it make any sense
> to flush all TLB entries? I don't think it does, but it depends
> how often this occurs. If we're doing it on a regular basis because
> of some workload, then that workload suffers. If it's a rare event
> then maybe that's okay to do.
It does not matter whether it's rare or not.
The scenario which made us look is that CPU0 is housekeeping and CPU1 is
isolated for RT.
Now CPU0 does that flush nonsense and the RT workload on CPU1 suffers
because the compute time is suddenly factor 2-3 larger, IOW, it misses
the deadline. That means a one off event is already a problem.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:54 ` Thomas Gleixner
@ 2023-05-16 9:48 ` Baoquan He
0 siblings, 0 replies; 75+ messages in thread
From: Baoquan He @ 2023-05-16 9:48 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Andrew Morton, linux-mm, Christoph Hellwig,
Lorenzo Stoakes, Peter Zijlstra, John Ogness, linux-arm-kernel,
Russell King, Mark Rutland, Marc Zyngier
On 05/16/23 at 10:54am, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 16:07, Baoquan He wrote:
> > On 05/16/23 at 08:40am, Thomas Gleixner wrote:
> >> On Tue, May 16 2023 at 10:26, Baoquan He wrote:
> >> > On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
> >> >> For systems which lack a full TLB flush and to flush a long range is
> >> >> a problem(it takes time), probably we can flush VA one by one. Because
> >> >> currently we calculate a flush range [min:max] and that range includes
> >> >> the space that might not be mapped at all. Like below:
> >> >
> >> > It's fine if we only calculate a flush range of [min:max] with VA. In
> >> > vm_reset_perms(), it calculates the flush range with the impacted direct
> >> > mapping range, then merge it with VA's range. That looks really strange
> >> > and surprising. If the vm->pages[] are got from a lower part of physical
> >> > memory, the final merged flush will span tremendous range. Wondering why
> >> > we need merge the direct map range with VA range, then do flush. Not
> >> > sure if I misunderstand it.
> >>
> >> So what happens on this BPF teardown is:
> >>
> >> The vfree(8k) ends up flushing 3 entries. The actual vmalloc part (2) and
> >> one extra which is in the direct map. I haven't verified that yet, but I
> >> assume it's the alias of one of the vmalloc'ed pages.
> >
> > It looks like the reason. As Uladzislau pointed out, ARCH-es may
> > have full TLB flush, so won't get trouble from the merged flush
> > in the calculated [min:max] way, e.g arm64 and x86's flush_tlb_kernel_range().
> > However, arm32 seems lacking the ability of full TLB flash.
>
> ARM has a full flush, but it does not check for that in
> flush_tlb_kernel_range().
>
> > If agreed, I can make a draft patch to do the flush for direct map and
> > VA seperately, see if it works.
>
> Of course it works. Already done that.
>
> But you are missing the point. Look at the examples I provided.
>
> The current implementation ends up doing a full flush on x86 just to
> flush 3 TLB entries. For the very same reason because the flush range
> (start..end) becomes insanely large due to the direct map and vmalloc
> parts.
>
> But doing indivudual flushes for direct map and vmalloc space is silly
> too because then it ends up doing two IPIs instead of one. IPIs are
> expensive and the whole point of coalescing the flushes is to spare
> IPIs, no?
>
> So with my hacked up flush_tlb_kernel_vas() I end up having exactly
> _one_ IPI which walks the list and flushes the 3 TLB entries.
Makes sense, thanks for telling. While your handling about alias_va may
not be right. I will add inline comment in your patch, please check
there.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 9:03 ` Thomas Gleixner
@ 2023-05-16 10:05 ` Baoquan He
2023-05-16 14:21 ` Thomas Gleixner
2023-05-19 13:49 ` Excessive TLB flush ranges Thomas Gleixner
1 sibling, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-16 10:05 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On 05/16/23 at 11:03am, Thomas Gleixner wrote:
......
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1724,7 +1724,8 @@ static void purge_fragmented_blocks_allc
> */
> static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> {
> - unsigned long resched_threshold;
> + unsigned long resched_threshold, num_entries = 0, num_alias_entries = 0;
> + struct vmap_area alias_va = { .va_start = start, .va_end = end };
Note that the start and end passed in are not only direct map which is
alias of va. It is the range which has done merging on direct map range
and all dirty range of vbq in _vm_unmap_aliases(). We may need to append
below draft code on your patch to at least flush the direct map range
separately.
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7672c2422f0c..beaaa2f983d3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1722,13 +1722,15 @@ static void purge_fragmented_blocks_allcpus(void);
/*
* Purges all lazily-freed vmap areas.
*/
-static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+static bool __purge_vmap_area_lazy(struct *range)
{
unsigned long resched_threshold, num_entries = 0, num_alias_entries = 0;
- struct vmap_area alias_va = { .va_start = start, .va_end = end };
+ struct vmap_area alias_va = { .va_start = range[0].start, .va_end = range[0].end };
+ struct vmap_area dirty_va = { .va_start = range[1].start, .va_end = range[1].end };
unsigned int num_purged_areas = 0;
struct list_head local_purge_list;
struct vmap_area *va, *n_va;
+ unsigned long start = ULONG_MAX, end = 0;
lockdep_assert_held(&vmap_purge_lock);
@@ -1737,6 +1739,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
list_replace_init(&purge_vmap_area_list, &local_purge_list);
spin_unlock(&purge_vmap_area_lock);
+ start = alias_va.va_start;
+ end = alias_va.va_end;
+ start = min(start, dirty_va.va_start);
+ end = min(start, dirty_va.va_end);
start = min(start, list_first_entry(&local_purge_list, struct vmap_area, list)->va_start);
end = max(end, list_last_entry(&local_purge_list, struct vmap_area, list)->va_end);
@@ -1752,6 +1758,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
list_add(&alias_va.list, &local_purge_list);
}
+ if (dirty_va.va_end > dirty_va.va_start) {
+ num_alias_entries = (dirty_va.va_end - dirty_va.va_start) >> PAGE_SHIFT;
+ list_add(&dirty_va.list, &local_purge_list);
+ }
flush_tlb_kernel_vas(&local_purge_list, num_entries + num_alias_entries);
if (num_alias_entries)
@@ -2236,15 +2246,18 @@ static void vb_free(unsigned long addr, unsigned long size)
spin_unlock(&vb->lock);
}
-static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
+static void _vm_unmap_aliases(unsigned long dm_start, unsigned long dm_end, int flush)
{
int cpu;
+ struct range range[2];
+ unsigned long start = ULONG_MAX, end = 0;
if (unlikely(!vmap_initialized))
return;
might_sleep();
+ range[0] = {dm_start, dm_end};
for_each_possible_cpu(cpu) {
struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
struct vmap_block *vb;
@@ -2269,6 +2282,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
rcu_read_unlock();
}
+ range[1] = {start, end};
mutex_lock(&vmap_purge_lock);
purge_fragmented_blocks_allcpus();
if (!__purge_vmap_area_lazy(start, end) && flush)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 8:48 ` Russell King (Oracle)
@ 2023-05-16 12:09 ` Thomas Gleixner
2023-05-16 13:42 ` Uladzislau Rezki
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 12:09 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 09:48, Russell King wrote:
> On Tue, May 16, 2023 at 10:44:07AM +0200, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 09:19, Russell King wrote:
>> > On Tue, May 16, 2023 at 08:37:18AM +0200, Thomas Gleixner wrote:
>> >> void flush_tlb_kernel_vas(struct list_head *list, unsigned int num_entries):
>> >>
>> >> So that an architecture can decide whether it's worth to do walk the
>> >> entries or whether it resorts to a flush all.
>> >
>> > Is "num_entries" what an arch would want to use? How would it use that?
>> > It doesn't tell an arch whether there is a large range of many list
>> > entries, or a single entry covering a large range.
>>
>> Does it matter?
>>
>> The total number of entries to flush is what accumulates and at some
>> architecture specific threshold that becomes more expensive than a full
>> flush, independent of the range of the individual list entries, no?
>
> It depends what you mean by "num_entries" - is that the number of
> pages to be flushed in total in the range?
Yes. The sum of all va list ranges.
> If so, what does a valid "start" and "end" range passed to
> __purge_vmap_area_lazy() mean for num_entries - does that go to
> (end - start) / PAGE_SIZE, or would it still be restricted to the
> sum of that per list entry? If so, what's the point of passing in
> "start" and "end" to this function?
_vm_unmap_aliases() collects dirty ranges from per cpu vmap_block_queue
(what ever that is) and hands a start..end range to
__purge_vmap_area_lazy().
As I pointed out already, this can also end up being an excessive range
because there is no guarantee that those individual collected ranges are
consecutive. Though I have no idea how to cure that right now.
AFAICT this was done to spare flush IPIs, but the mm folks should be
able to explain that properly.
In the problematic case at hand and what I've seen from tracing so far,
e.g. for module unload this looks always the same. A small range of
direct map collected and then a bunch of vmap entries from the purge
list. But have not yet tried hard to figure out whether that direct map
collection is ever going to cover a larger range for no reason.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 12:09 ` Thomas Gleixner
@ 2023-05-16 13:42 ` Uladzislau Rezki
2023-05-16 14:38 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-16 13:42 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
> _vm_unmap_aliases() collects dirty ranges from per cpu vmap_block_queue
> (what ever that is) and hands a start..end range to
> __purge_vmap_area_lazy().
>
> As I pointed out already, this can also end up being an excessive range
> because there is no guarantee that those individual collected ranges are
> consecutive. Though I have no idea how to cure that right now.
>
> AFAICT this was done to spare flush IPIs, but the mm folks should be
> able to explain that properly.
>
This is done to prevent generating IPIs. That is why the whole range is
calculated once and a flush occurs only once for all lazily registered VAs.
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 10:05 ` Baoquan He
@ 2023-05-16 14:21 ` Thomas Gleixner
2023-05-16 19:03 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 14:21 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 18:05, Baoquan He wrote:
> On 05/16/23 at 11:03am, Thomas Gleixner wrote:
>> static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>> {
>> - unsigned long resched_threshold;
>> + unsigned long resched_threshold, num_entries = 0, num_alias_entries = 0;
>> + struct vmap_area alias_va = { .va_start = start, .va_end = end };
>
> Note that the start and end passed in are not only direct map which is
> alias of va. It is the range which has done merging on direct map range
> and all dirty range of vbq in _vm_unmap_aliases(). We may need to append
> below draft code on your patch to at least flush the direct map range
> separately.
Indeed. Missed that one. What a maze.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 13:42 ` Uladzislau Rezki
@ 2023-05-16 14:38 ` Thomas Gleixner
2023-05-16 15:01 ` Uladzislau Rezki
2023-05-16 17:56 ` Nadav Amit
0 siblings, 2 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 14:38 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 15:42, Uladzislau Rezki wrote:
>> _vm_unmap_aliases() collects dirty ranges from per cpu vmap_block_queue
>> (what ever that is) and hands a start..end range to
>> __purge_vmap_area_lazy().
>>
>> As I pointed out already, this can also end up being an excessive range
>> because there is no guarantee that those individual collected ranges are
>> consecutive. Though I have no idea how to cure that right now.
>>
>> AFAICT this was done to spare flush IPIs, but the mm folks should be
>> able to explain that properly.
>>
> This is done to prevent generating IPIs. That is why the whole range is
> calculated once and a flush occurs only once for all lazily registered VAs.
Sure, but you pretty much enforced flush_tlb_all() by doing that, which
is not even close to correct.
This range calculation is only correct when the resulting coalesced
range is consecutive, but if the resulting coalesced range is huge with
large holes and only a few pages to flush, then it's actively wrong.
The architecture has zero chance to decide whether it wants to flush
single entries or all in one go.
There is a world outside of x86, but even on x86 it's borderline silly
to take the whole TLB out when you can flush 3 TLB entries one by one
with exactly the same number of IPIs, i.e. _one_. No?
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 14:38 ` Thomas Gleixner
@ 2023-05-16 15:01 ` Uladzislau Rezki
2023-05-16 17:04 ` Thomas Gleixner
2023-05-16 17:56 ` Nadav Amit
1 sibling, 1 reply; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-16 15:01 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16, 2023 at 04:38:58PM +0200, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 15:42, Uladzislau Rezki wrote:
> >> _vm_unmap_aliases() collects dirty ranges from per cpu vmap_block_queue
> >> (what ever that is) and hands a start..end range to
> >> __purge_vmap_area_lazy().
> >>
> >> As I pointed out already, this can also end up being an excessive range
> >> because there is no guarantee that those individual collected ranges are
> >> consecutive. Though I have no idea how to cure that right now.
> >>
> >> AFAICT this was done to spare flush IPIs, but the mm folks should be
> >> able to explain that properly.
> >>
> > This is done to prevent generating IPIs. That is why the whole range is
> > calculated once and a flush occurs only once for all lazily registered VAs.
>
> Sure, but you pretty much enforced flush_tlb_all() by doing that, which
> is not even close to correct.
>
> This range calculation is only correct when the resulting coalesced
> range is consecutive, but if the resulting coalesced range is huge with
> large holes and only a few pages to flush, then it's actively wrong.
>
> The architecture has zero chance to decide whether it wants to flush
> single entries or all in one go.
>
Id depends what is a corner case what is not. Usually all allocations are
done sequentially. From the other hand it is not always true. A good
example is a module loading/unloading(it has a special place in vmap space).
In this scenario we are quite far in vmap space from for example VMALLOC_START
point. So it will require a flush_tlb_all, yes.
>
> There is a world outside of x86, but even on x86 it's borderline silly
> to take the whole TLB out when you can flush 3 TLB entries one by one
> with exactly the same number of IPIs, i.e. _one_. No?
>
I meant if we invoke flush_tlb_kernel_range() on each VA's individual
range:
<ARM>
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
if (tlb_ops_need_broadcast()) {
struct tlb_args ta;
ta.ta_start = start;
ta.ta_end = end;
on_each_cpu(ipi_flush_tlb_kernel_range, &ta, 1);
} else
local_flush_tlb_kernel_range(start, end);
broadcast_tlb_a15_erratum();
}
<ARM>
we should IPI and wait, no?
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 15:01 ` Uladzislau Rezki
@ 2023-05-16 17:04 ` Thomas Gleixner
2023-05-17 11:26 ` Uladzislau Rezki
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 17:04 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 17:01, Uladzislau Rezki wrote:
> On Tue, May 16, 2023 at 04:38:58PM +0200, Thomas Gleixner wrote:
>> There is a world outside of x86, but even on x86 it's borderline silly
>> to take the whole TLB out when you can flush 3 TLB entries one by one
>> with exactly the same number of IPIs, i.e. _one_. No?
>>
> I meant if we invoke flush_tlb_kernel_range() on each VA's individual
> range:
>
> <ARM>
> void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> {
> if (tlb_ops_need_broadcast()) {
> struct tlb_args ta;
> ta.ta_start = start;
> ta.ta_end = end;
> on_each_cpu(ipi_flush_tlb_kernel_range, &ta, 1);
> } else
> local_flush_tlb_kernel_range(start, end);
> broadcast_tlb_a15_erratum();
> }
> <ARM>
>
> we should IPI and wait, no?
The else clause does not do an IPI, but that's irrelevant.
The proposed flush_tlb_kernel_vas(list, num_pages) mechanism
achieves:
1) It batches multiple ranges to _one_ invocation
2) It lets the architecture decide based on the number of pages
whether it does a tlb_flush_all() or a flush of individual ranges.
Whether the architecture uses IPIs or flushes only locally and the
hardware propagates that is completely irrelevant.
Right now any coalesced range, which is huge due to massive holes, takes
decision #2 away.
If you want to flush individual VAs from the core vmalloc code then you
lose #1, as the aggregated number of pages might justify a tlb_flush_all().
That's a pure architecture decision and all the core code needs to do is
to provide appropriate information and not some completely bogus request
to flush 17312759359 pages, i.e. a ~64.5 TB range, while in reality
there are exactly _three_ distinct pages to flush.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 14:38 ` Thomas Gleixner
2023-05-16 15:01 ` Uladzislau Rezki
@ 2023-05-16 17:56 ` Nadav Amit
2023-05-16 19:32 ` Thomas Gleixner
1 sibling, 1 reply; 75+ messages in thread
From: Nadav Amit @ 2023-05-16 17:56 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
> On May 16, 2023, at 7:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> There is a world outside of x86, but even on x86 it's borderline silly
> to take the whole TLB out when you can flush 3 TLB entries one by one
> with exactly the same number of IPIs, i.e. _one_. No?
I just want to re-raise points that were made in the past, including in
the discussion that I sent before and match my experience.
Feel free to reject them, but I think you should not ignore them.
In a nutshell, there is a tradeoff which is non-trivial. Controlling
the exact ranges that need to be flushed might require, especially in
IPI-based TLB invalidation systems, additional logic and more cache
lines that need to traverse between the caches.
The latter - the cache-lines that hold the ranges that need to be
flushed - are the main issue. They might induce overhead that negates
the benefits if in most cases it turns out that many pages are
flushed. Data structures such as linked-lists might therefore not be
suitable to hold the ranges that need to be flushed, as they are not
cache-friendly. The data that is transferred between the cores to
indicate which ranges should be flushed would ideally be cache line
aligned and fit into a single cache-line.
It is possible that for kernel ranges, where the stride is always a
base-page size (4KB on x86) you might come with more condense way
of communicating TLB flushing ranges of kernel pages than userspace
pages. Perhaps the workload characteristics are different. But it
should be noticed that major parts of the rationale behind the
changes that you suggest could also apply to TLB invalidations of
userspace mapping, as done in tlb_gather and UBC mechanisms. But in
those cases the rationale, at least for x86, was that since the CPU
knows to do TLB refills very efficiently, the extra complexity and
overheads are likely not to worth the trouble.
I hope my feedback is useful. Here is again a link to a discussion
from 2015 about this subject:
https://lore.kernel.org/all/CA+55aFwVUkdaf0_rBk7uJHQjWXu+OcLTHc6FKuCn0Cb2Kvg9NA@mail.gmail.com/
There are several patches that showed the benefit of reducing
cache contention during TLB shootdown. Here is one for example:
https://lore.kernel.org/all/20190423065706.15430-1-namit@vmware.com/
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 14:21 ` Thomas Gleixner
@ 2023-05-16 19:03 ` Thomas Gleixner
2023-05-17 9:38 ` Thomas Gleixner
` (3 more replies)
0 siblings, 4 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 19:03 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 16:21, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 18:05, Baoquan He wrote:
>> On 05/16/23 at 11:03am, Thomas Gleixner wrote:
>>> static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>>> {
>>> - unsigned long resched_threshold;
>>> + unsigned long resched_threshold, num_entries = 0, num_alias_entries = 0;
>>> + struct vmap_area alias_va = { .va_start = start, .va_end = end };
>>
>> Note that the start and end passed in are not only direct map which is
>> alias of va. It is the range which has done merging on direct map range
>> and all dirty range of vbq in _vm_unmap_aliases(). We may need to append
>> below draft code on your patch to at least flush the direct map range
>> separately.
>
> Indeed. Missed that one. What a maze.
Staring more at this vbq handling in _vm_unmap_aliases():
It iterates over all possible CPUs and accounts a range when
if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS)
so that it gets flushed, but it does not modify that data at all, so a
concurrent invocation on a different CPU will see the same ranges and
flush them too, right?
Now after this loop _vm_unmap_aliases() invokes
purge_fragmented_blocks_allcpus(), but this time under vmap_purge_lock.
purge_fragmented_blocks_allcpus() iterates over all possible CPUs once
again iterate the same per CPU queues in order to potentially free
blocks.
I'm probably missing something, but can't this be _one_ loop which
handles all of this under vmap_purge_lock?
Aside of that, if I read the code correctly then if there is an unmap
via vb_free() which does not cover the whole vmap block then vb->dirty
is set and every _vm_unmap_aliases() invocation flushes that dirty range
over and over until that vmap block is completely freed, no?
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 17:56 ` Nadav Amit
@ 2023-05-16 19:32 ` Thomas Gleixner
2023-05-17 0:23 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-16 19:32 UTC (permalink / raw)
To: Nadav Amit
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 10:56, Nadav Amit wrote:
>> On May 16, 2023, at 7:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> There is a world outside of x86, but even on x86 it's borderline silly
>> to take the whole TLB out when you can flush 3 TLB entries one by one
>> with exactly the same number of IPIs, i.e. _one_. No?
>
> I just want to re-raise points that were made in the past, including in
> the discussion that I sent before and match my experience.
>
> Feel free to reject them, but I think you should not ignore them.
I'm not ignoring them and I'm well aware of these issues. No need to
repeat them over and over. I'm old but not senile yet.
It might turn out that it's not the proper solution for x86, but the
generic vmalloc code as of today is written with an x86 centric view.
That actively hurts other architectures which do have different
constraints than x86. Any architecture which can do IPI-less TLB flushes
seriously wants to decide on their own whether a flush all is the better
option or not. Blindly coalescing random address ranges makes this
impossible as demonstrated.
Aside of that I just wrote the patch for x86 in the first place because
I couldn't be bothered to find and setup an ARM box to test on.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 19:32 ` Thomas Gleixner
@ 2023-05-17 0:23 ` Thomas Gleixner
2023-05-17 1:23 ` Nadav Amit
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-17 0:23 UTC (permalink / raw)
To: Nadav Amit
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 21:32, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 10:56, Nadav Amit wrote:
>>> On May 16, 2023, at 7:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>
>>> There is a world outside of x86, but even on x86 it's borderline silly
>>> to take the whole TLB out when you can flush 3 TLB entries one by one
>>> with exactly the same number of IPIs, i.e. _one_. No?
>>
>> I just want to re-raise points that were made in the past, including in
>> the discussion that I sent before and match my experience.
>>
>> Feel free to reject them, but I think you should not ignore them.
>
> I'm not ignoring them and I'm well aware of these issues. No need to
> repeat them over and over. I'm old but not senile yet.
Just to be clear. This works the other way round too.
It makes a whole lot of a difference whether you do 5 IPIs in a row
which all need to get a cache line updated or if you have _one_ which
needs a couple of cache lines updated.
INVLPG is not serializing so the CPU can pull in the next required cache
line(s) on the VA list during that. These cache lines are _not_
contended at that point because _all_ of these data structures are not
longer globally accessible (mis-speculation aside) and therefore not
exclusive (misalignment aside, but you have to prove that this is an
issue).
So just dismissing this on 10 years old experience is not really
helpful, though I'm happy to confirm your points once I had the time and
opportunity to actually run real testing over it, unless you beat me to
it.
What I can confirm is that it solves a real world problem on !x86
machines for the pathological case at hand
On the affected contemporary ARM32 machine, which does not require
IPIs, the selective flush is way better than:
- the silly 1.G range one page by one flush (which is silly on its
own as there is no range check)
- a full tlb flush just for 3 pages, which is the same on x86 albeit
the flush range is ~64GB there.
The point is that the generic vmalloc code is making assumptions which
are x86 centric on not even necessarily true on x86.
Whether or not this is benefitial on x86 that's a completey separate
debate.
There is also a debate required whether a wholesale "flush on _ALL_
CPUs' is justified when some of those CPUs are completely isolated and
have absolutely no chance to be affected by that. This process bound
seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
their computation in user space just because...
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 0:23 ` Thomas Gleixner
@ 2023-05-17 1:23 ` Nadav Amit
2023-05-17 10:31 ` Thomas Gleixner
2023-05-17 12:12 ` Russell King (Oracle)
0 siblings, 2 replies; 75+ messages in thread
From: Nadav Amit @ 2023-05-17 1:23 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, May 16 2023 at 21:32, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 10:56, Nadav Amit wrote:
>>>> On May 16, 2023, at 7:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>
>>>> There is a world outside of x86, but even on x86 it's borderline silly
>>>> to take the whole TLB out when you can flush 3 TLB entries one by one
>>>> with exactly the same number of IPIs, i.e. _one_. No?
>>>
>>> I just want to re-raise points that were made in the past, including in
>>> the discussion that I sent before and match my experience.
>>>
>>> Feel free to reject them, but I think you should not ignore them.
>>
>> I'm not ignoring them and I'm well aware of these issues. No need to
>> repeat them over and over. I'm old but not senile yet.
Thomas, no disrespect was intended. I initially just sent the link and I
had a sense (based on my past experience) that nobody clicked on it.
>
> Just to be clear. This works the other way round too.
>
> It makes a whole lot of a difference whether you do 5 IPIs in a row
> which all need to get a cache line updated or if you have _one_ which
> needs a couple of cache lines updated.
Obviously, if the question is 5 IPIs or 1 IPI with more flushing data,
the 1 IPI wins. The question I was focusing on is whether 1 IPI with
potentially global flush or detailed list of ranges to flush.
>
> INVLPG is not serializing so the CPU can pull in the next required cache
> line(s) on the VA list during that.
Indeed, but ChatGPT says (yes, I see you making fun of me already):
“however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
can cause a pipeline stall because the TLB entry invalidation must be
completed before subsequent instructions that might rely on the TLB can
be executed correctly.”
So I am not sure that your claim is exactly correct.
> These cache lines are _not_
> contended at that point because _all_ of these data structures are not
> longer globally accessible (mis-speculation aside) and therefore not
> exclusive (misalignment aside, but you have to prove that this is an
> issue).
This is not entirely true. Indeed whether you have 1 remote core or N
remote core is not a whole issue (putting aside NUMA). But you will get
first a snoop to the initiator cache by the responding core, and then,
after the TLB invalidation is completed, an RFO by the initiator once
it writes to the cache again. If the invalidation data is on the stack
(as you did), this is even more likely to happen shortly after.
>
> So just dismissing this on 10 years old experience is not really
> helpful, though I'm happy to confirm your points once I had the time and
> opportunity to actually run real testing over it, unless you beat me to
> it.
I really don’t know what “dismissing” you are talking about. I do have
relatively recent experience with the overhead of caching effects on
TLB shootdown time. It can become very apparent. You can find some
numbers in, for instance, the patch of mine I quoted in my previous
email.
There are additional opportunities to reduce the caching effect for
x86, such as combining the SMP-code metadata with the TLB-invalidation
metadata (which is out of the scope) that I saw having performance
benefit. That’s all to say that caching effect is not something to
be considered obsolete.
>
> What I can confirm is that it solves a real world problem on !x86
> machines for the pathological case at hand
>
> On the affected contemporary ARM32 machine, which does not require
> IPIs, the selective flush is way better than:
>
> - the silly 1.G range one page by one flush (which is silly on its
> own as there is no range check)
>
> - a full tlb flush just for 3 pages, which is the same on x86 albeit
> the flush range is ~64GB there.
>
> The point is that the generic vmalloc code is making assumptions which
> are x86 centric on not even necessarily true on x86.
>
> Whether or not this is benefitial on x86 that's a completey separate
> debate.
I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
into 1, it is (pretty much) all benefit and there is no tradeoff. I was
focusing on the question of whether it is beneficial also to do precise
TLB flushing, and the tradeoff there is less clear (especially that the
kernel uses 2MB pages).
My experience with non-IPI based TLB invalidations is more limited. IIUC
the usage model is that the TLB shootdowns should be invoked ASAP
(perhaps each range can be batched, but there is no sense of batching
multiple ranges), and then later you would issue some barrier to ensure
prior TLB shootdown invocations have been completed.
If that is the (use) case, I am not sure the abstraction you used in
your prototype is the best one.
> There is also a debate required whether a wholesale "flush on _ALL_
> CPUs' is justified when some of those CPUs are completely isolated and
> have absolutely no chance to be affected by that. This process bound
> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
> their computation in user space just because…
I hope you would excuse my ignorance (I am sure you won’t), but isn’t
the seccomp/BPF VMAP ranges are mapped on all processes (considering
PTI of course)? Are you suggesting you want a per-process kernel
address space? (which can make senes, I guess)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 19:03 ` Thomas Gleixner
@ 2023-05-17 9:38 ` Thomas Gleixner
2023-05-17 10:52 ` Baoquan He
2023-05-19 12:01 ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
` (2 subsequent siblings)
3 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-17 9:38 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16 2023 at 21:03, Thomas Gleixner wrote:
>
> Aside of that, if I read the code correctly then if there is an unmap
> via vb_free() which does not cover the whole vmap block then vb->dirty
> is set and every _vm_unmap_aliases() invocation flushes that dirty range
> over and over until that vmap block is completely freed, no?
Something like the below would cure that.
While it prevents that this is flushed forever it does not cure the
eventually overly broad flush when the block is completely dirty and
purged:
Assume a block with 1024 pages, where 1022 pages are already freed and
TLB flushed. Now the last 2 pages are freed and the block is purged,
which results in a flush of 1024 pages where 1022 are already done,
right?
Thanks,
tglx
---
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2211,7 +2211,7 @@ static void vb_free(unsigned long addr,
spin_lock(&vb->lock);
- /* Expand dirty range */
+ /* Expand the not yet TLB flushed dirty range */
vb->dirty_min = min(vb->dirty_min, offset);
vb->dirty_max = max(vb->dirty_max, offset + (1UL << order));
@@ -2240,13 +2240,17 @@ static void _vm_unmap_aliases(unsigned l
rcu_read_lock();
list_for_each_entry_rcu(vb, &vbq->free, free_list) {
spin_lock(&vb->lock);
- if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
+ if (vb->dirty_max && vb->dirty != VMAP_BBMAP_BITS) {
unsigned long va_start = vb->va->va_start;
unsigned long s, e;
s = va_start + (vb->dirty_min << PAGE_SHIFT);
e = va_start + (vb->dirty_max << PAGE_SHIFT);
+ /* Prevent that this is flushed more than once */
+ vb->dirty_min = VMAP_BBMAP_BITS;
+ vb->dirty_max = 0;
+
start = min(s, start);
end = max(e, end);
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 1:23 ` Nadav Amit
@ 2023-05-17 10:31 ` Thomas Gleixner
2023-05-17 11:47 ` Thomas Gleixner
` (2 more replies)
2023-05-17 12:12 ` Russell King (Oracle)
1 sibling, 3 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-17 10:31 UTC (permalink / raw)
To: Nadav Amit
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
Nadav!
On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> I'm not ignoring them and I'm well aware of these issues. No need to
>>> repeat them over and over. I'm old but not senile yet.
>
> Thomas, no disrespect was intended. I initially just sent the link and I
> had a sense (based on my past experience) that nobody clicked on it.
All good.
>> It makes a whole lot of a difference whether you do 5 IPIs in a row
>> which all need to get a cache line updated or if you have _one_ which
>> needs a couple of cache lines updated.
>
> Obviously, if the question is 5 IPIs or 1 IPI with more flushing data,
> the 1 IPI wins. The question I was focusing on is whether 1 IPI with
> potentially global flush or detailed list of ranges to flush.
Correct and there is obviously a tradeoff too, which has yet to be
determined.
>> INVLPG is not serializing so the CPU can pull in the next required cache
>> line(s) on the VA list during that.
>
> Indeed, but ChatGPT says (yes, I see you making fun of me already):
> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
> can cause a pipeline stall because the TLB entry invalidation must be
> completed before subsequent instructions that might rely on the TLB can
> be executed correctly.”
>
> So I am not sure that your claim is exactly correct.
Key is a subsequent instruction which might depend on the to be flushed
TLB entry. That's obvious, but I'm having a hard time to construct that
dependent intruction in this case.
>> These cache lines are _not_
>> contended at that point because _all_ of these data structures are not
>> longer globally accessible (mis-speculation aside) and therefore not
>> exclusive (misalignment aside, but you have to prove that this is an
>> issue).
>
> This is not entirely true. Indeed whether you have 1 remote core or N
> remote core is not a whole issue (putting aside NUMA). But you will get
> first a snoop to the initiator cache by the responding core, and then,
> after the TLB invalidation is completed, an RFO by the initiator once
> it writes to the cache again. If the invalidation data is on the stack
> (as you did), this is even more likely to happen shortly after.
That's correct and there might be smarter ways to handle that list muck.
>> So just dismissing this on 10 years old experience is not really
>> helpful, though I'm happy to confirm your points once I had the time and
>> opportunity to actually run real testing over it, unless you beat me to
>> it.
>
> I really don’t know what “dismissing” you are talking about.
Sorry, I was overreacting due to increased grumpiness.
> I do have relatively recent experience with the overhead of caching
> effects on TLB shootdown time. It can become very apparent. You can
> find some numbers in, for instance, the patch of mine I quoted in my
> previous email.
>
> There are additional opportunities to reduce the caching effect for
> x86, such as combining the SMP-code metadata with the TLB-invalidation
> metadata (which is out of the scope) that I saw having performance
> benefit. That’s all to say that caching effect is not something to
> be considered obsolete.
I never claimed that it does not matter. That's surely part of a
decision making to investigate that.
>> The point is that the generic vmalloc code is making assumptions which
>> are x86 centric on not even necessarily true on x86.
>>
>> Whether or not this is benefitial on x86 that's a completey separate
>> debate.
>
> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
> focusing on the question of whether it is beneficial also to do precise
> TLB flushing, and the tradeoff there is less clear (especially that the
> kernel uses 2MB pages).
For the vmalloc() area mappings? Not really.
> My experience with non-IPI based TLB invalidations is more limited. IIUC
> the usage model is that the TLB shootdowns should be invoked ASAP
> (perhaps each range can be batched, but there is no sense of batching
> multiple ranges), and then later you would issue some barrier to ensure
> prior TLB shootdown invocations have been completed.
>
> If that is the (use) case, I am not sure the abstraction you used in
> your prototype is the best one.
The way how arm/arm64 implement that in software is:
magic_barrier1();
flush_range_with_magic_opcodes();
magic_barrier2();
And for that use case having the list with individual ranges is not
really wrong.
Maybe ARM[64] could do this smarter, but that would require to rewrite a
lot of code I assume.
>> There is also a debate required whether a wholesale "flush on _ALL_
>> CPUs' is justified when some of those CPUs are completely isolated and
>> have absolutely no chance to be affected by that. This process bound
>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>> their computation in user space just because…
>
> I hope you would excuse my ignorance (I am sure you won’t), but isn’t
> the seccomp/BPF VMAP ranges are mapped on all processes (considering
> PTI of course)? Are you suggesting you want a per-process kernel
> address space? (which can make senes, I guess)
Right. The BPF muck is mapped in the global kernel space, but e.g. the
seccomp filters are individual per process. At least that's how I
understand it, but I might be completely wrong.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 9:38 ` Thomas Gleixner
@ 2023-05-17 10:52 ` Baoquan He
2023-05-19 11:22 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-17 10:52 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On 05/17/23 at 11:38am, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 21:03, Thomas Gleixner wrote:
> >
> > Aside of that, if I read the code correctly then if there is an unmap
> > via vb_free() which does not cover the whole vmap block then vb->dirty
> > is set and every _vm_unmap_aliases() invocation flushes that dirty range
> > over and over until that vmap block is completely freed, no?
>
> Something like the below would cure that.
>
> While it prevents that this is flushed forever it does not cure the
> eventually overly broad flush when the block is completely dirty and
> purged:
>
> Assume a block with 1024 pages, where 1022 pages are already freed and
> TLB flushed. Now the last 2 pages are freed and the block is purged,
> which results in a flush of 1024 pages where 1022 are already done,
> right?
This is good idea, I am thinking how to reply to your last mail and how
to fix this. While your cure code may not work well. Please see below
inline comment.
One vmap block has 64 pages.
#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */
>
> Thanks,
>
> tglx
> ---
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2211,7 +2211,7 @@ static void vb_free(unsigned long addr,
>
> spin_lock(&vb->lock);
>
> - /* Expand dirty range */
> + /* Expand the not yet TLB flushed dirty range */
> vb->dirty_min = min(vb->dirty_min, offset);
> vb->dirty_max = max(vb->dirty_max, offset + (1UL << order));
>
> @@ -2240,13 +2240,17 @@ static void _vm_unmap_aliases(unsigned l
> rcu_read_lock();
> list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> spin_lock(&vb->lock);
> - if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
> + if (vb->dirty_max && vb->dirty != VMAP_BBMAP_BITS) {
> unsigned long va_start = vb->va->va_start;
> unsigned long s, e;
When vb_free() is invoked, it could cause three kinds of vmap_block as
below. Your code works well for the 2nd case, for the 1st one, it may be
not. And the 2nd one is the stuff that we reclaim and put into purge
list in purge_fragmented_blocks_allcpus().
1)
|-----|------------|-----------|-------|
|dirty|still mapped| dirty | free |
2)
|------------------------------|-------|
| dirty | free |
3) Handled by free_vmap_block(), and vb is put into purge list.
|--------------------------------------|
>
> s = va_start + (vb->dirty_min << PAGE_SHIFT);
> e = va_start + (vb->dirty_max << PAGE_SHIFT);
>
> + /* Prevent that this is flushed more than once */
> + vb->dirty_min = VMAP_BBMAP_BITS;
> + vb->dirty_max = 0;
> +
> start = min(s, start);
> end = max(e, end);
>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 17:04 ` Thomas Gleixner
@ 2023-05-17 11:26 ` Uladzislau Rezki
2023-05-17 11:58 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-17 11:26 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16, 2023 at 07:04:34PM +0200, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 17:01, Uladzislau Rezki wrote:
> > On Tue, May 16, 2023 at 04:38:58PM +0200, Thomas Gleixner wrote:
> >> There is a world outside of x86, but even on x86 it's borderline silly
> >> to take the whole TLB out when you can flush 3 TLB entries one by one
> >> with exactly the same number of IPIs, i.e. _one_. No?
> >>
> > I meant if we invoke flush_tlb_kernel_range() on each VA's individual
> > range:
> >
> > <ARM>
> > void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> > {
> > if (tlb_ops_need_broadcast()) {
> > struct tlb_args ta;
> > ta.ta_start = start;
> > ta.ta_end = end;
> > on_each_cpu(ipi_flush_tlb_kernel_range, &ta, 1);
> > } else
> > local_flush_tlb_kernel_range(start, end);
> > broadcast_tlb_a15_erratum();
> > }
> > <ARM>
> >
> > we should IPI and wait, no?
>
> The else clause does not do an IPI, but that's irrelevant.
>
> The proposed flush_tlb_kernel_vas(list, num_pages) mechanism
> achieves:
>
> 1) It batches multiple ranges to _one_ invocation
>
> 2) It lets the architecture decide based on the number of pages
> whether it does a tlb_flush_all() or a flush of individual ranges.
>
> Whether the architecture uses IPIs or flushes only locally and the
> hardware propagates that is completely irrelevant.
>
> Right now any coalesced range, which is huge due to massive holes, takes
> decision #2 away.
>
> If you want to flush individual VAs from the core vmalloc code then you
> lose #1, as the aggregated number of pages might justify a tlb_flush_all().
>
> That's a pure architecture decision and all the core code needs to do is
> to provide appropriate information and not some completely bogus request
> to flush 17312759359 pages, i.e. a ~64.5 TB range, while in reality
> there are exactly _three_ distinct pages to flush.
>
1.
I think, all two cases(logic) should be moved into ARCH code, so a decision
is made _not_ by vmalloc code how to flush, either fully, if it supported or
page by page that require list chasing.
As for vmalloc interace, we can provide the list(we keep it short, because of
merging property) + number of pages to flush.
2.
It looks like your problem is because of
void vfree(const void *addr)
{
...
if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
vm_reset_perms(vm); <----
...
}
so, all purged areas are drained in a caller context, so it is blocked
until the drain is done including flushing. I am not sure why it is done
from a caller context.
IMHO, it should be deferred same way as we do in:
static void free_vmap_area_noflush(struct vmap_area *va)
if do not miss the point why vfree() has to do it directly.
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 10:31 ` Thomas Gleixner
@ 2023-05-17 11:47 ` Thomas Gleixner
2023-05-17 22:41 ` Nadav Amit
2023-05-17 14:43 ` Mark Rutland
2023-05-17 22:57 ` Nadav Amit
2 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-17 11:47 UTC (permalink / raw)
To: Nadav Amit
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Wed, May 17 2023 at 12:31, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>>> INVLPG is not serializing so the CPU can pull in the next required cache
>>> line(s) on the VA list during that.
>>
>> Indeed, but ChatGPT says (yes, I see you making fun of me already):
>> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
>> can cause a pipeline stall because the TLB entry invalidation must be
>> completed before subsequent instructions that might rely on the TLB can
>> be executed correctly.”
>>
>> So I am not sure that your claim is exactly correct.
>
> Key is a subsequent instruction which might depend on the to be flushed
> TLB entry. That's obvious, but I'm having a hard time to construct that
> dependent intruction in this case.
But obviously a full TLB flush _is_ guaranteed to stall the pipeline,
right?
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 11:26 ` Uladzislau Rezki
@ 2023-05-17 11:58 ` Thomas Gleixner
2023-05-17 12:15 ` Uladzislau Rezki
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-17 11:58 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Wed, May 17 2023 at 13:26, Uladzislau Rezki wrote:
> On Tue, May 16, 2023 at 07:04:34PM +0200, Thomas Gleixner wrote:
>> The proposed flush_tlb_kernel_vas(list, num_pages) mechanism
>> achieves:
>>
>> 1) It batches multiple ranges to _one_ invocation
>>
>> 2) It lets the architecture decide based on the number of pages
>> whether it does a tlb_flush_all() or a flush of individual ranges.
>>
>> Whether the architecture uses IPIs or flushes only locally and the
>> hardware propagates that is completely irrelevant.
>>
>> Right now any coalesced range, which is huge due to massive holes, takes
>> decision #2 away.
>>
>> If you want to flush individual VAs from the core vmalloc code then you
>> lose #1, as the aggregated number of pages might justify a tlb_flush_all().
>>
>> That's a pure architecture decision and all the core code needs to do is
>> to provide appropriate information and not some completely bogus request
>> to flush 17312759359 pages, i.e. a ~64.5 TB range, while in reality
>> there are exactly _three_ distinct pages to flush.
>>
> 1.
>
> I think, all two cases(logic) should be moved into ARCH code, so a decision
> is made _not_ by vmalloc code how to flush, either fully, if it supported or
> page by page that require list chasing.
Which is exactly what my patch does, no?
> 2.
> void vfree(const void *addr)
> {
> ...
> if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
> vm_reset_perms(vm); <----
> ...
> }
>
> so, all purged areas are drained in a caller context, so it is blocked
> until the drain is done including flushing. I am not sure why it is done
> from a caller context.
>
> IMHO, it should be deferred same way as we do in:
>
> static void free_vmap_area_noflush(struct vmap_area *va)
How is that avoiding the problem? It just deferres it to some point in
the future. There is no guarantee that batching will be large enough to
justify a full flush.
> if do not miss the point why vfree() has to do it directly.
Keeping executable mappings around until some other flush happens is
obviously neither a brilliant idea nor correct.
Thanks
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 1:23 ` Nadav Amit
2023-05-17 10:31 ` Thomas Gleixner
@ 2023-05-17 12:12 ` Russell King (Oracle)
2023-05-17 23:14 ` Nadav Amit
1 sibling, 1 reply; 75+ messages in thread
From: Russell King (Oracle) @ 2023-05-17 12:12 UTC (permalink / raw)
To: Nadav Amit
Cc: Thomas Gleixner, Uladzislau Rezki, Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Tue, May 16, 2023 at 06:23:27PM -0700, Nadav Amit wrote:
> Indeed, but ChatGPT says (yes, I see you making fun of me already):
> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
> can cause a pipeline stall because the TLB entry invalidation must be
> completed before subsequent instructions that might rely on the TLB can
> be executed correctly.”
>
> So I am not sure that your claim is exactly correct.
Sorry, but chatgpt has no place in serious discussions. You don't know
where its getting its information from or how reliable it is, and you'll
only make yourself look silly by quoting it.
Someone recently asked ChatGPT about me. Apparently I died a few years
ago. This is news to me. However, this illustrates precisely my point
that chatgpt can spew utter rubbish that you've no idea where it's got
it from, and by quoting its output in a serious discussion, you only
make yourself look silly.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 11:58 ` Thomas Gleixner
@ 2023-05-17 12:15 ` Uladzislau Rezki
2023-05-17 16:32 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-17 12:15 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Wed, May 17, 2023 at 01:58:44PM +0200, Thomas Gleixner wrote:
> On Wed, May 17 2023 at 13:26, Uladzislau Rezki wrote:
> > On Tue, May 16, 2023 at 07:04:34PM +0200, Thomas Gleixner wrote:
> >> The proposed flush_tlb_kernel_vas(list, num_pages) mechanism
> >> achieves:
> >>
> >> 1) It batches multiple ranges to _one_ invocation
> >>
> >> 2) It lets the architecture decide based on the number of pages
> >> whether it does a tlb_flush_all() or a flush of individual ranges.
> >>
> >> Whether the architecture uses IPIs or flushes only locally and the
> >> hardware propagates that is completely irrelevant.
> >>
> >> Right now any coalesced range, which is huge due to massive holes, takes
> >> decision #2 away.
> >>
> >> If you want to flush individual VAs from the core vmalloc code then you
> >> lose #1, as the aggregated number of pages might justify a tlb_flush_all().
> >>
> >> That's a pure architecture decision and all the core code needs to do is
> >> to provide appropriate information and not some completely bogus request
> >> to flush 17312759359 pages, i.e. a ~64.5 TB range, while in reality
> >> there are exactly _three_ distinct pages to flush.
> >>
> > 1.
> >
> > I think, all two cases(logic) should be moved into ARCH code, so a decision
> > is made _not_ by vmalloc code how to flush, either fully, if it supported or
> > page by page that require list chasing.
>
> Which is exactly what my patch does, no?
>
> > 2.
> > void vfree(const void *addr)
> > {
> > ...
> > if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
> > vm_reset_perms(vm); <----
> > ...
> > }
> >
> > so, all purged areas are drained in a caller context, so it is blocked
> > until the drain is done including flushing. I am not sure why it is done
> > from a caller context.
> >
> > IMHO, it should be deferred same way as we do in:
> >
> > static void free_vmap_area_noflush(struct vmap_area *va)
>
> How is that avoiding the problem? It just deferres it to some point in
> the future. There is no guarantee that batching will be large enough to
> justify a full flush.
>
> > if do not miss the point why vfree() has to do it directly.
>
> Keeping executable mappings around until some other flush happens is
> obviously neither a brilliant idea nor correct.
>
It avoids of blocking a caller on vfree() by deferring the freeing into
a workqueue context. At least i got the filling that "your task" that
does vfree() blocks for unacceptable time. It can happen only if it
performs VM_FLUSH_RESET_PERMS freeing(other freeing are deferred):
<snip>
if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
vm_reset_perms(vm);
<snip>
in this case the vfree() can take some time instead of returning back to
a user asap. Is that your issue? I am not talking that TLB flushing takes
time, in this case holding on mutex also can take time.
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 10:31 ` Thomas Gleixner
2023-05-17 11:47 ` Thomas Gleixner
@ 2023-05-17 14:43 ` Mark Rutland
2023-05-17 16:41 ` Thomas Gleixner
2023-05-17 22:57 ` Nadav Amit
2 siblings, 1 reply; 75+ messages in thread
From: Mark Rutland @ 2023-05-17 14:43 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Nadav Amit, Uladzislau Rezki, Russell King (Oracle),
Andrew Morton, linux-mm, Christoph Hellwig, Lorenzo Stoakes,
Peter Zijlstra, Baoquan He, John Ogness, linux-arm-kernel,
Marc Zyngier, x86
On Wed, May 17, 2023 at 12:31:04PM +0200, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
> >> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > My experience with non-IPI based TLB invalidations is more limited. IIUC
> > the usage model is that the TLB shootdowns should be invoked ASAP
> > (perhaps each range can be batched, but there is no sense of batching
> > multiple ranges), and then later you would issue some barrier to ensure
> > prior TLB shootdown invocations have been completed.
> >
> > If that is the (use) case, I am not sure the abstraction you used in
> > your prototype is the best one.
>
> The way how arm/arm64 implement that in software is:
>
> magic_barrier1();
> flush_range_with_magic_opcodes();
> magic_barrier2();
FWIW, on arm64 that sequence (for leaf entries only) is:
/*
* Make sure prior writes to the page table entries are visible to all
* CPUs, so that *subsequent* page table walks will see the latest
* values.
*
* This is roughly __smp_wmb().
*/
dsb(ishst) // AKA magic_barrier1()
/*
* The "TLBI *IS, <addr>" instructions send a message to all other
* CPUs, essentially saying "please start invalidating entries for
* <addr>"
*
* The "TLBI *ALL*IS" instructions send a message to all other CPUs,
* essentially saying "please start invalidating all entries".
*
* In theory, this could be for discontiguous ranges.
*/
flush_range_with_magic_opcodes()
/*
* Wait for acknowledgement that all prior TLBIs have completed. This
* also ensures that all accesses using those translations have also
* completed.
*
* This waits for all relevant CPUs to acknowledge completion of any
* prior TLBIs sent by this CPU.
*/
dsb(ish) // AKA magic_barrier2()
isb()
So you can batch a bunch of "TLBI *IS, <addr>" with a single barrier for
completion, or you can use a single "TLBI *ALL*IS" to invalidate everything.
It can still be worth using the latter, as arm64 has done since commit:
05ac65305437e8ef ("arm64: fix soft lockup due to large tlb flush range")
... as for a large range, issuing a bunch of "TLBI *IS, <addr>" can take a
while, and can require the recipient CPUs to do more work than they might have
to do for a single "TLBI *ALL*IS".
The point at which invalidating everything is better depends on a number of
factors (e.g. the impact of all CPUs needing to make new page table walks), and
currently we have an arbitrary boundary where we choose to invalidate
everything (which has been tweaked a bit over time); there isn't really a
one-size-fits-all best answer.
Thanks,
Mark.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 12:15 ` Uladzislau Rezki
@ 2023-05-17 16:32 ` Thomas Gleixner
2023-05-19 10:01 ` Uladzislau Rezki
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-17 16:32 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Wed, May 17 2023 at 14:15, Uladzislau Rezki wrote:
> On Wed, May 17, 2023 at 01:58:44PM +0200, Thomas Gleixner wrote:
>> Keeping executable mappings around until some other flush happens is
>> obviously neither a brilliant idea nor correct.
>>
> It avoids of blocking a caller on vfree() by deferring the freeing into
> a workqueue context. At least i got the filling that "your task" that
> does vfree() blocks for unacceptable time. It can happen only if it
> performs VM_FLUSH_RESET_PERMS freeing(other freeing are deferred):
>
> <snip>
> if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
> vm_reset_perms(vm);
> <snip>
>
> in this case the vfree() can take some time instead of returning back to
> a user asap. Is that your issue? I am not talking that TLB flushing takes
> time, in this case holding on mutex also can take time.
This is absolutely not the problem at all. This comes via do_exit() and
I explained already here:
https://lore.kernel.org/all/871qjg8wqe.ffs@tglx
what made us look into this and I'm happy to quote myself for your
conveniance:
"The scenario which made us look is that CPU0 is housekeeping and CPU1 is
isolated for RT.
Now CPU0 does that flush nonsense and the RT workload on CPU1 suffers
because the compute time is suddenly factor 2-3 larger, IOW, it misses
the deadline. That means a one off event is already a problem."
So it does not matter at all how long the operations on CPU0 take. The
only thing which matters is how much these operations affect the
workload on CPU1.
That made me look into this coalescing code. I understand why you want
to batch and coalesce and rather do a rare full tlb flush than sending
gazillions of IPIs.
But that creates a policy at the core code which does not leave any
decision to make for the architecture, whether it's worth to do full or
single flushes. That's what I worried about and not about the question
whether that free takes 1ms or 10us. That's a completely different
debate.
Whether that list based flush turns out to be the better solution or
not, has still to be decided by deeper analysis.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 14:43 ` Mark Rutland
@ 2023-05-17 16:41 ` Thomas Gleixner
0 siblings, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-17 16:41 UTC (permalink / raw)
To: Mark Rutland
Cc: Nadav Amit, Uladzislau Rezki, Russell King (Oracle),
Andrew Morton, linux-mm, Christoph Hellwig, Lorenzo Stoakes,
Peter Zijlstra, Baoquan He, John Ogness, linux-arm-kernel,
Marc Zyngier, x86
On Wed, May 17 2023 at 15:43, Mark Rutland wrote:
> On Wed, May 17, 2023 at 12:31:04PM +0200, Thomas Gleixner wrote:
>> The way how arm/arm64 implement that in software is:
>>
>> magic_barrier1();
>> flush_range_with_magic_opcodes();
>> magic_barrier2();
>
> FWIW, on arm64 that sequence (for leaf entries only) is:
>
> /*
> * Make sure prior writes to the page table entries are visible to all
> * CPUs, so that *subsequent* page table walks will see the latest
> * values.
> *
> * This is roughly __smp_wmb().
> */
> dsb(ishst) // AKA magic_barrier1()
>
> /*
> * The "TLBI *IS, <addr>" instructions send a message to all other
> * CPUs, essentially saying "please start invalidating entries for
> * <addr>"
> *
> * The "TLBI *ALL*IS" instructions send a message to all other CPUs,
> * essentially saying "please start invalidating all entries".
> *
> * In theory, this could be for discontiguous ranges.
> */
> flush_range_with_magic_opcodes()
>
> /*
> * Wait for acknowledgement that all prior TLBIs have completed. This
> * also ensures that all accesses using those translations have also
> * completed.
> *
> * This waits for all relevant CPUs to acknowledge completion of any
> * prior TLBIs sent by this CPU.
> */
> dsb(ish) // AKA magic_barrier2()
> isb()
>
> So you can batch a bunch of "TLBI *IS, <addr>" with a single barrier for
> completion, or you can use a single "TLBI *ALL*IS" to invalidate everything.
>
> It can still be worth using the latter, as arm64 has done since commit:
>
> 05ac65305437e8ef ("arm64: fix soft lockup due to large tlb flush range")
>
> ... as for a large range, issuing a bunch of "TLBI *IS, <addr>" can take a
> while, and can require the recipient CPUs to do more work than they might have
> to do for a single "TLBI *ALL*IS".
And looking at the changelog and backtrace:
PC is at __cpu_flush_kern_tlb_range+0xc/0x40
LR is at __purge_vmap_area_lazy+0x28c/0x3ac
I'm willing to bet that this is exactly the same scenario of a direct
map + module area flush. That's the only one we found so far which
creates insanely large ranges.
The other effects of coalescing can still result in seriously oversized
flushs for just a couple of pages. The worst I've seen aside of that BPF
muck was a 'flush 2 pages' with an resulting range of ~3.8MB.
> The point at which invalidating everything is better depends on a number of
> factors (e.g. the impact of all CPUs needing to make new page table walks), and
> currently we have an arbitrary boundary where we choose to invalidate
> everything (which has been tweaked a bit over time); there isn't really a
> one-size-fits-all best answer.
I'm well aware of that :)
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 11:47 ` Thomas Gleixner
@ 2023-05-17 22:41 ` Nadav Amit
0 siblings, 0 replies; 75+ messages in thread
From: Nadav Amit @ 2023-05-17 22:41 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
> On May 17, 2023, at 4:47 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, May 17 2023 at 12:31, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>>>> INVLPG is not serializing so the CPU can pull in the next required cache
>>>> line(s) on the VA list during that.
>>>
>>> Indeed, but ChatGPT says (yes, I see you making fun of me already):
>>> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
>>> can cause a pipeline stall because the TLB entry invalidation must be
>>> completed before subsequent instructions that might rely on the TLB can
>>> be executed correctly.”
>>>
>>> So I am not sure that your claim is exactly correct.
>>
>> Key is a subsequent instruction which might depend on the to be flushed
>> TLB entry. That's obvious, but I'm having a hard time to construct that
>> dependent intruction in this case.
>
> But obviously a full TLB flush _is_ guaranteed to stall the pipeline,
> right?
Right. I had a discussion about it with ChatGPT but it started to say BS,
so here is my understanding of the matter.
IIUC, when you flush a TLB entry the CPU might have a problem figuring out
the translation of which memory addresses is affected. All the RAW conflict
detection mechanisms in such a case are probably useless, since even the
granularity of the invalidation (e.g., 4KB/2MB/1GB) might be unknown during
decoding. It is not that it is impossible to obtain this information, it is
just that I am doubtful the CPU architects optimized this flow.
As a result I think the pieces of code as the following ones are affected
(taken from from flush_tlb_func() ):
while (addr < f->end) {
flush_tlb_one_user(addr);
addr += 1UL << f->stride_shift;
}
flush_tlb_one_user has a memory clobber on the INVLPG. As a result f->end
and f->stride_shift would need to be reread from memory and their loading
would be stalled.
As they reside in L1 cache already, the impact is low. Having said that,
the fact that flush_tlb_one_user() flushes the “PTI” (userspace mappings)
as well does introduce small overhead relatively to the alternative of
having two separate loops to flush kernel/userspace mappings when PTI is
enabled.
Long story short, I think that prefetching the entries that you want to
flush - assuming they do not fit on a single cacheline - might be needed.
Linked list would therefore not be very friendly for something like that.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 10:31 ` Thomas Gleixner
2023-05-17 11:47 ` Thomas Gleixner
2023-05-17 14:43 ` Mark Rutland
@ 2023-05-17 22:57 ` Nadav Amit
2023-05-19 11:49 ` Thomas Gleixner
2 siblings, 1 reply; 75+ messages in thread
From: Nadav Amit @ 2023-05-17 22:57 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
> On May 17, 2023, at 3:31 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>>> The point is that the generic vmalloc code is making assumptions which
>>> are x86 centric on not even necessarily true on x86.
>>>
>>> Whether or not this is benefitial on x86 that's a completey separate
>>> debate.
>>
>> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
>> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
>> focusing on the question of whether it is beneficial also to do precise
>> TLB flushing, and the tradeoff there is less clear (especially that the
>> kernel uses 2MB pages).
>
> For the vmalloc() area mappings? Not really.
The main penalty of doing a global flush are the innocent bystanders TLB
translations. These are likely the regular mappings, not the malloc-ones.
>
>> My experience with non-IPI based TLB invalidations is more limited. IIUC
>> the usage model is that the TLB shootdowns should be invoked ASAP
>> (perhaps each range can be batched, but there is no sense of batching
>> multiple ranges), and then later you would issue some barrier to ensure
>> prior TLB shootdown invocations have been completed.
>>
>> If that is the (use) case, I am not sure the abstraction you used in
>> your prototype is the best one.
>
> The way how arm/arm64 implement that in software is:
>
> magic_barrier1();
> flush_range_with_magic_opcodes();
> magic_barrier2();
>
> And for that use case having the list with individual ranges is not
> really wrong.
>
> Maybe ARM[64] could do this smarter, but that would require to rewrite a
> lot of code I assume.
What you say makes sense - and I actually see that flush_tlb_page_nosync()
needs a memory barrier.
I just encountered recent patches that did the flushing on ARM in an
async manner as I described. That is the reason I assumed it is more efficient.
https://lore.kernel.org/linux-mm/20230410134352.4519-3-yangyicong@huawei.com/
>
>>> There is also a debate required whether a wholesale "flush on _ALL_
>>> CPUs' is justified when some of those CPUs are completely isolated and
>>> have absolutely no chance to be affected by that. This process bound
>>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>>> their computation in user space just because…
>>
>> I hope you would excuse my ignorance (I am sure you won’t), but isn’t
>> the seccomp/BPF VMAP ranges are mapped on all processes (considering
>> PTI of course)? Are you suggesting you want a per-process kernel
>> address space? (which can make senes, I guess)
>
> Right. The BPF muck is mapped in the global kernel space, but e.g. the
> seccomp filters are individual per process. At least that's how I
> understand it, but I might be completely wrong.
After rehashing the seccomp man page, they are not entirely “private” for
each process, as they are maintained after fork/exec. Yet, one can imagine
that it is possible to create non-global kernel mappings that would be
mapped per-process and would hold the seccomp filters. This would remove
the need to do system-wide flushes when the process dies.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 12:12 ` Russell King (Oracle)
@ 2023-05-17 23:14 ` Nadav Amit
0 siblings, 0 replies; 75+ messages in thread
From: Nadav Amit @ 2023-05-17 23:14 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Thomas Gleixner, Uladzislau Rezki, Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
> On May 17, 2023, at 5:12 AM, Russell King (Oracle) <linux@armlinux.org.uk> wrote:
>
> On Tue, May 16, 2023 at 06:23:27PM -0700, Nadav Amit wrote:
>> Indeed, but ChatGPT says (yes, I see you making fun of me already):
>> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
>> can cause a pipeline stall because the TLB entry invalidation must be
>> completed before subsequent instructions that might rely on the TLB can
>> be executed correctly.”
>>
>> So I am not sure that your claim is exactly correct.
>
> Sorry, but chatgpt has no place in serious discussions. You don't know
> where its getting its information from or how reliable it is, and you'll
> only make yourself look silly by quoting it.
>
> Someone recently asked ChatGPT about me. Apparently I died a few years
> ago. This is news to me. However, this illustrates precisely my point
> that chatgpt can spew utter rubbish that you've no idea where it's got
> it from, and by quoting its output in a serious discussion, you only
> make yourself look silly.
Take it easy Russel, I am not throwing unfiltered stuff. It makes perfect
sense that you get a dependency due to INVLPG with the following load
instructions. Intel SDM does not regard something like that to explicitly
describe when such dependency can occur (flush of the same page or
any page?), but at least in some cases there is no question there is an
impact on the pipeline as otherwise correctness would be violated.
If you have a different experience or think it is nonsensical, let me
know.
Anyhow, I am happy to hear that you didn’t die. :)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 16:32 ` Thomas Gleixner
@ 2023-05-19 10:01 ` Uladzislau Rezki
2023-05-19 14:56 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-19 10:01 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Wed, May 17, 2023 at 06:32:25PM +0200, Thomas Gleixner wrote:
> On Wed, May 17 2023 at 14:15, Uladzislau Rezki wrote:
> > On Wed, May 17, 2023 at 01:58:44PM +0200, Thomas Gleixner wrote:
> >> Keeping executable mappings around until some other flush happens is
> >> obviously neither a brilliant idea nor correct.
> >>
> > It avoids of blocking a caller on vfree() by deferring the freeing into
> > a workqueue context. At least i got the filling that "your task" that
> > does vfree() blocks for unacceptable time. It can happen only if it
> > performs VM_FLUSH_RESET_PERMS freeing(other freeing are deferred):
> >
> > <snip>
> > if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
> > vm_reset_perms(vm);
> > <snip>
> >
> > in this case the vfree() can take some time instead of returning back to
> > a user asap. Is that your issue? I am not talking that TLB flushing takes
> > time, in this case holding on mutex also can take time.
>
> This is absolutely not the problem at all. This comes via do_exit() and
> I explained already here:
>
> https://lore.kernel.org/all/871qjg8wqe.ffs@tglx
>
> what made us look into this and I'm happy to quote myself for your
> conveniance:
>
> "The scenario which made us look is that CPU0 is housekeeping and CPU1 is
> isolated for RT.
>
> Now CPU0 does that flush nonsense and the RT workload on CPU1 suffers
> because the compute time is suddenly factor 2-3 larger, IOW, it misses
> the deadline. That means a one off event is already a problem."
>
> So it does not matter at all how long the operations on CPU0 take. The
> only thing which matters is how much these operations affect the
> workload on CPU1.
>
Thanks. I focused on your first email, where you have not mentioned your
second part, explaining that you have a housekeeping CPU and another for
RT activity.
>
> That made me look into this coalescing code. I understand why you want
> to batch and coalesce and rather do a rare full tlb flush than sending
> gazillions of IPIs.
>
Your issues has no connections with merging. But the place you looked
was correct :)
>
> But that creates a policy at the core code which does not leave any
> decision to make for the architecture, whether it's worth to do full or
> single flushes. That's what I worried about and not about the question
> whether that free takes 1ms or 10us. That's a completely different
> debate.
>
> Whether that list based flush turns out to be the better solution or
> not, has still to be decided by deeper analysis.
>
I had a look how per-VA TLB flushing behaves on x86_64 under heavy load:
<snip>
commit 776a33ed63f0f15b5b3f6254bcb927a45e37298d (HEAD -> master)
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date: Fri May 19 11:35:35 2023 +0200
mm: vmalloc: Flush TLB per-va
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9683573f1225..6ff95f3d1fa1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1739,15 +1739,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
if (unlikely(list_empty(&local_purge_list)))
goto out;
- start = min(start,
- list_first_entry(&local_purge_list,
- struct vmap_area, list)->va_start);
+ /* OK. A per-cpu wants to flush an exact range. */
+ if (start != ULONG_MAX)
+ flush_tlb_kernel_range(start, end);
- end = max(end,
- list_last_entry(&local_purge_list,
- struct vmap_area, list)->va_end);
+ /* Flush per-VA. */
+ list_for_each_entry(va, &local_purge_list, list)
+ flush_tlb_kernel_range(va->va_start, va->va_end);
- flush_tlb_kernel_range(start, end);
resched_threshold = lazy_max_pages() << 1;
spin_lock(&free_vmap_area_lock);
<snip>
There are at least two observation:
1. asm_sysvec_call_function adds extra 12% in therms of cycles
# per-VA TLB flush
- 12.00% native_queued_spin_lock_slowpath ▒
- 11.90% asm_sysvec_call_function ▒
- sysvec_call_function ▒
__sysvec_call_function ▒
- __flush_smp_call_function_queue ▒
- 1.64% __flush_tlb_all ▒
native_flush_tlb_global ▒
native_write_cr4 ▒
# default
0.18% 0.16% [kernel] [k] asm_sysvec_call_function
2. Memory footprint grows(under heavy load) because the TLB-flush + extra lazy-list scan
take longer time.
Hope it could be somehow useful for you.
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 10:52 ` Baoquan He
@ 2023-05-19 11:22 ` Thomas Gleixner
2023-05-19 11:49 ` Baoquan He
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 11:22 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Wed, May 17 2023 at 18:52, Baoquan He wrote:
> On 05/17/23 at 11:38am, Thomas Gleixner wrote:
>> On Tue, May 16 2023 at 21:03, Thomas Gleixner wrote:
>> >
>> > Aside of that, if I read the code correctly then if there is an unmap
>> > via vb_free() which does not cover the whole vmap block then vb->dirty
>> > is set and every _vm_unmap_aliases() invocation flushes that dirty range
>> > over and over until that vmap block is completely freed, no?
>>
>> Something like the below would cure that.
>>
>> While it prevents that this is flushed forever it does not cure the
>> eventually overly broad flush when the block is completely dirty and
>> purged:
>>
>> Assume a block with 1024 pages, where 1022 pages are already freed and
>> TLB flushed. Now the last 2 pages are freed and the block is purged,
>> which results in a flush of 1024 pages where 1022 are already done,
>> right?
>
> This is good idea, I am thinking how to reply to your last mail and how
> to fix this. While your cure code may not work well. Please see below
> inline comment.
See below.
> One vmap block has 64 pages.
> #define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */
No, VMAP_MAX_ALLOC is the allocation limit for a single vb_alloc().
On 64bit it has at least 128 pages, but can have up to 1024:
#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */
#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2)
and then some magic happens to calculate the actual size
#define VMAP_BBMAP_BITS \
VMAP_MIN(VMAP_BBMAP_BITS_MAX, \
VMAP_MAX(VMAP_BBMAP_BITS_MIN, \
VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
which is in a range of (2*BITS_PER_LONG) ... 1024.
The actual vmap block size is:
#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE)
Which is then obviously something between 512k and 4MB on 64bit and
between 256k and 4MB on 32bit.
>> @@ -2240,13 +2240,17 @@ static void _vm_unmap_aliases(unsigned l
>> rcu_read_lock();
>> list_for_each_entry_rcu(vb, &vbq->free, free_list) {
>> spin_lock(&vb->lock);
>> - if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
>> + if (vb->dirty_max && vb->dirty != VMAP_BBMAP_BITS) {
>> unsigned long va_start = vb->va->va_start;
>> unsigned long s, e;
>
> When vb_free() is invoked, it could cause three kinds of vmap_block as
> below. Your code works well for the 2nd case, for the 1st one, it may be
> not. And the 2nd one is the stuff that we reclaim and put into purge
> list in purge_fragmented_blocks_allcpus().
>
> 1)
> |-----|------------|-----------|-------|
> |dirty|still mapped| dirty | free |
>
> 2)
> |------------------------------|-------|
> | dirty | free |
You sure? The first one is put into the purge list too.
/* Expand dirty range */
vb->dirty_min = min(vb->dirty_min, offset);
vb->dirty_max = max(vb->dirty_max, offset + (1UL << order));
pages bits dirtymin dirtymax
vb_alloc(A) 2 0 - 1 VMAP_BBMAP_BITS 0
vb_alloc(B) 4 2 - 5
vb_alloc(C) 2 6 - 7
So you get three variants:
1) Flush after freeing A
vb_free(A) 2 0 - 1 0 1
Flush VMAP_BBMAP_BITS 0 <- correct
vb_free(C) 2 6 - 7 6 7
Flush VMAP_BBMAP_BITS 0 <- correct
2) No flush between freeing A and C
vb_free(A) 2 0 - 1 0 1
vb_free(C) 2 6 - 7 0 7
Flush VMAP_BBMAP_BITS 0 <- overbroad flush
3) No flush between freeing A, C, B
vb_free(A) 2 0 - 1 0 1
vb_free(C) 2 6 - 7 0 7
vb_free(C) 2 2 - 5 0 7
Flush VMAP_BBMAP_BITS 0 <- correct
So my quick hack makes it correct for #1 and #3 and prevents repeated
flushes of already flushed areas.
To prevent #2 you need a bitmap which keeps track of the flushed areas.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-19 11:22 ` Thomas Gleixner
@ 2023-05-19 11:49 ` Baoquan He
2023-05-19 14:13 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-19 11:49 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On 05/19/23 at 01:22pm, Thomas Gleixner wrote:
> On Wed, May 17 2023 at 18:52, Baoquan He wrote:
> > On 05/17/23 at 11:38am, Thomas Gleixner wrote:
> >> On Tue, May 16 2023 at 21:03, Thomas Gleixner wrote:
> >> >
> >> > Aside of that, if I read the code correctly then if there is an unmap
> >> > via vb_free() which does not cover the whole vmap block then vb->dirty
> >> > is set and every _vm_unmap_aliases() invocation flushes that dirty range
> >> > over and over until that vmap block is completely freed, no?
> >>
> >> Something like the below would cure that.
> >>
> >> While it prevents that this is flushed forever it does not cure the
> >> eventually overly broad flush when the block is completely dirty and
> >> purged:
> >>
> >> Assume a block with 1024 pages, where 1022 pages are already freed and
> >> TLB flushed. Now the last 2 pages are freed and the block is purged,
> >> which results in a flush of 1024 pages where 1022 are already done,
> >> right?
> >
> > This is good idea, I am thinking how to reply to your last mail and how
> > to fix this. While your cure code may not work well. Please see below
> > inline comment.
>
> See below.
>
> > One vmap block has 64 pages.
> > #define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */
>
> No, VMAP_MAX_ALLOC is the allocation limit for a single vb_alloc().
>
> On 64bit it has at least 128 pages, but can have up to 1024:
>
> #define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */
> #define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2)
>
> and then some magic happens to calculate the actual size
>
> #define VMAP_BBMAP_BITS \
> VMAP_MIN(VMAP_BBMAP_BITS_MAX, \
> VMAP_MAX(VMAP_BBMAP_BITS_MIN, \
> VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
>
> which is in a range of (2*BITS_PER_LONG) ... 1024.
>
> The actual vmap block size is:
>
> #define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE)
You are right, it's 1024. I was dizzy at that time.
>
> Which is then obviously something between 512k and 4MB on 64bit and
> between 256k and 4MB on 32bit.
>
> >> @@ -2240,13 +2240,17 @@ static void _vm_unmap_aliases(unsigned l
> >> rcu_read_lock();
> >> list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> >> spin_lock(&vb->lock);
> >> - if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
> >> + if (vb->dirty_max && vb->dirty != VMAP_BBMAP_BITS) {
> >> unsigned long va_start = vb->va->va_start;
> >> unsigned long s, e;
> >
> > When vb_free() is invoked, it could cause three kinds of vmap_block as
> > below. Your code works well for the 2nd case, for the 1st one, it may be
> > not. And the 2nd one is the stuff that we reclaim and put into purge
> > list in purge_fragmented_blocks_allcpus().
> >
> > 1)
> > |-----|------------|-----------|-------|
> > |dirty|still mapped| dirty | free |
> >
> > 2)
> > |------------------------------|-------|
> > | dirty | free |
>
>
> You sure? The first one is put into the purge list too.
No way. You don't copy the essential code here. The key line is
calculation of vb->dirty. ->dirty_min and ->dirty_max only provides a
loose vlaue for calculating the flush range. Counting more or less page
of ->dirty_min or ->dirty_max won't impact much, just make flush do
some menningless operation. While counting ->dirty wrong will cause
serious problem. If you put case 1 into purge list, freeing it later
will fail because you can't find it in vmap_area_root tree. Please check
vfree() and remove_vm_area().
/* Expand dirty range */
vb->dirty_min = min(vb->dirty_min, offset);
vb->dirty_max = max(vb->dirty_max, offset + (1UL << order));
vb->dirty += 1UL << order;
Plesae note the judgement of the 2nd case as below:
Means there's only free and dirty, and diryt doesn't reach
VMAP_BBMAP_BITS, it's case 2.
(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS)
By the way, I made a RFC patchset based on your patch, and your earlier
mail in which you raised some questions. I will add it here, please help
check if it's worth posting for discussing and reviewing.
>
> /* Expand dirty range */
> vb->dirty_min = min(vb->dirty_min, offset);
> vb->dirty_max = max(vb->dirty_max, offset + (1UL << order));
>
> pages bits dirtymin dirtymax
> vb_alloc(A) 2 0 - 1 VMAP_BBMAP_BITS 0
> vb_alloc(B) 4 2 - 5
> vb_alloc(C) 2 6 - 7
>
> So you get three variants:
>
> 1) Flush after freeing A
>
> vb_free(A) 2 0 - 1 0 1
> Flush VMAP_BBMAP_BITS 0 <- correct
> vb_free(C) 2 6 - 7 6 7
> Flush VMAP_BBMAP_BITS 0 <- correct
>
>
> 2) No flush between freeing A and C
>
> vb_free(A) 2 0 - 1 0 1
> vb_free(C) 2 6 - 7 0 7
> Flush VMAP_BBMAP_BITS 0 <- overbroad flush
>
>
> 3) No flush between freeing A, C, B
>
> vb_free(A) 2 0 - 1 0 1
> vb_free(C) 2 6 - 7 0 7
> vb_free(C) 2 2 - 5 0 7
> Flush VMAP_BBMAP_BITS 0 <- correct
>
> So my quick hack makes it correct for #1 and #3 and prevents repeated
> flushes of already flushed areas.
>
> To prevent #2 you need a bitmap which keeps track of the flushed areas.
I made a draft patchset based on your earlier mail,
>
> Thanks,
>
> tglx
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-17 22:57 ` Nadav Amit
@ 2023-05-19 11:49 ` Thomas Gleixner
0 siblings, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 11:49 UTC (permalink / raw)
To: Nadav Amit
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Wed, May 17 2023 at 15:57, Nadav Amit wrote:
>> On May 17, 2023, at 3:31 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Maybe ARM[64] could do this smarter, but that would require to rewrite a
>> lot of code I assume.
>
> What you say makes sense - and I actually see that flush_tlb_page_nosync()
> needs a memory barrier.
>
> I just encountered recent patches that did the flushing on ARM in an
> async manner as I described. That is the reason I assumed it is more efficient.
>
> https://lore.kernel.org/linux-mm/20230410134352.4519-3-yangyicong@huawei.com/
This operates on a mm and is related to batched removal of user space
mappings.
That could be done for vmalloc too, but the life time tracking,
i.e. ensuring that a nosync flush has seen the final barrier which
ensures that the mapping is not longer visible might be slightly more
tricky. Especially with the full x86 centric code there.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one
2023-05-16 19:03 ` Thomas Gleixner
2023-05-17 9:38 ` Thomas Gleixner
@ 2023-05-19 12:01 ` Baoquan He
2023-05-19 14:16 ` Thomas Gleixner
2023-05-19 12:02 ` [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately Baoquan He
2023-05-19 12:03 ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
3 siblings, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-19 12:01 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
In the current __purge_vmap_area_lazy(), when trying to flush TLB of
vmalloc area, it calculate the flushing the range with [min:max] of vas.
That calculated range could be big because of the gap between the vas.
E.g in below graph, there are only 12 (4 from va_1, 8 from va_2) pages.
While the calculated flush range is 58.
VA_1 VA_2
|....|-------------------------|............|
10 12 60 68
. mapped;
- not mapped.
Sometime the calculated flush range could be surprisingly huge because
the vas could cross two kernel virtual address area. E.g the vmalloc and
the kernel module area are very far away from each other on some
architectures.
So for systems which lack a full TLB flush, to flush a long range is
a big problem(it takes time). Flushing va one by one becomes necessary
in that case.
Hence, introduce flush_tlb_kernel_vas() to try to flush va one by one.
And add CONFIG_HAVE_FLUSH_TLB_KERNEL_VAS to indicate if a certain
architecture has provided a flush_tlb_kernel_vas() implementation.
Otherwise, take the old way to calculate and flush the whole range.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baoquan He <bhe@redhat.com> #Fix error of 'undefined reference to `flush_tlb_kernel_vas''
---
arch/Kconfig | 4 ++++
arch/arm/Kconfig | 1 +
arch/arm/kernel/smp_tlb.c | 23 +++++++++++++++++++++++
arch/x86/Kconfig | 1 +
arch/x86/mm/tlb.c | 22 ++++++++++++++++++++++
include/linux/vmalloc.h | 8 ++++++++
mm/vmalloc.c | 32 ++++++++++++++++++++++----------
7 files changed, 81 insertions(+), 10 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..ca5413f1e4e0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -270,6 +270,10 @@ config ARCH_HAS_SET_MEMORY
config ARCH_HAS_SET_DIRECT_MAP
bool
+# Select if architecture provides flush_tlb_kernel_vas()
+config ARCH_HAS_FLUSH_TLB_KERNEL_VAS
+ bool
+
#
# Select if the architecture provides the arch_dma_set_uncached symbol to
# either provide an uncached segment alias for a DMA allocation, or
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 0fb4b218f665..c4de7f38f9a7 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -10,6 +10,7 @@ config ARM
select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
+ select ARCH_HAS_FLUSH_TLB_KERNEL_VAS
select ARCH_HAS_KEEPINITRD
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_SYNC_CORE
diff --git a/arch/arm/kernel/smp_tlb.c b/arch/arm/kernel/smp_tlb.c
index d4908b3736d8..22ec9b982cb1 100644
--- a/arch/arm/kernel/smp_tlb.c
+++ b/arch/arm/kernel/smp_tlb.c
@@ -7,6 +7,7 @@
#include <linux/preempt.h>
#include <linux/smp.h>
#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
#include <asm/smp_plat.h>
#include <asm/tlbflush.h>
@@ -69,6 +70,19 @@ static inline void ipi_flush_tlb_kernel_range(void *arg)
local_flush_tlb_kernel_range(ta->ta_start, ta->ta_end);
}
+static inline void local_flush_tlb_kernel_vas(struct list_head *vmap_list)
+{
+ struct vmap_area *va;
+
+ list_for_each_entry(va, vmap_list, list)
+ local_flush_tlb_kernel_range(va->va_start, va->va_end);
+}
+
+static inline void ipi_flush_tlb_kernel_vas(void *arg)
+{
+ local_flush_tlb_kernel_vas(arg);
+}
+
static inline void ipi_flush_bp_all(void *ignored)
{
local_flush_bp_all();
@@ -244,6 +258,15 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
broadcast_tlb_a15_erratum();
}
+void flush_tlb_kernel_vas(struct list_head *vmap_list, unsigned long num_entries)
+{
+ if (tlb_ops_need_broadcast()) {
+ on_each_cpu(ipi_flush_tlb_kernel_vas, vmap_list, 1);
+ } else
+ local_flush_tlb_kernel_vas(vmap_list);
+ broadcast_tlb_a15_erratum();
+}
+
void flush_bp_all(void)
{
if (tlb_ops_need_broadcast())
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..7d7a44810a0b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -77,6 +77,7 @@ config X86
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_EARLY_DEBUG if KGDB
select ARCH_HAS_ELF_RANDOMIZE
+ select ARCH_HAS_FLUSH_TLB_KERNEL_VAS
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf27480a..c39d77eb37e4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
#include <linux/debugfs.h>
#include <linux/sched/smt.h>
#include <linux/task_work.h>
+#include <linux/vmalloc.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1081,6 +1082,27 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
}
}
+static void do_flush_tlb_vas(void *arg)
+{
+ struct list_head *vmap_list = arg;
+ struct vmap_area *va;
+ unsigned long addr;
+
+ list_for_each_entry(va, vmap_list, list) {
+ /* flush range by one by one 'invlpg' */
+ for (addr = va->va_start; addr < va->va_end; addr += PAGE_SIZE)
+ flush_tlb_one_kernel(addr);
+ }
+}
+
+void flush_tlb_kernel_vas(struct list_head *vmap_list, unsigned long num_entries)
+{
+ if (num_entries > tlb_single_page_flush_ceiling)
+ on_each_cpu(do_flush_tlb_all, NULL, 1);
+ else
+ on_each_cpu(do_flush_tlb_vas, vmap_list, 1);
+}
+
/*
* This can be used from process context to figure out what the value of
* CR3 is without needing to do a (slow) __read_cr3().
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..a9a1e488261d 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -295,4 +295,12 @@ bool vmalloc_dump_obj(void *object);
static inline bool vmalloc_dump_obj(void *object) { return false; }
#endif
+#if defined(CONFIG_HAVE_FLUSH_TLB_KERNEL_VAS)
+void flush_tlb_kernel_vas(struct list_head *list, unsigned long num_entries);
+#else
+static inline void flush_tlb_kernel_vas(struct list_head *list, unsigned long num_entries)
+{
+}
+#endif
+
#endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c0f80982eb06..31e8d9e93650 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1724,7 +1724,8 @@ static void purge_fragmented_blocks_allcpus(void);
*/
static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
{
- unsigned long resched_threshold;
+ unsigned long resched_threshold, num_entries = 0, num_alias_entries = 0;
+ struct vmap_area alias_va = { .va_start = start, .va_end = end };
unsigned int num_purged_areas = 0;
struct list_head local_purge_list;
struct vmap_area *va, *n_va;
@@ -1736,18 +1737,29 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
list_replace_init(&purge_vmap_area_list, &local_purge_list);
spin_unlock(&purge_vmap_area_lock);
- if (unlikely(list_empty(&local_purge_list)))
- goto out;
+ start = min(start, list_first_entry(&local_purge_list, struct vmap_area, list)->va_start);
+ end = max(end, list_last_entry(&local_purge_list, struct vmap_area, list)->va_end);
+
+ if (IS_ENABLED(CONFIG_HAVE_FLUSH_TLB_KERNEL_VAS)) {
+ list_for_each_entry(va, &local_purge_list, list)
+ num_entries += (va->va_end - va->va_start) >> PAGE_SHIFT;
+
+ if (unlikely(!num_entries))
+ goto out;
+
+ if (alias_va.va_end > alias_va.va_start) {
+ num_alias_entries = (alias_va.va_end - alias_va.va_start) >> PAGE_SHIFT;
+ list_add(&alias_va.list, &local_purge_list);
+ }
- start = min(start,
- list_first_entry(&local_purge_list,
- struct vmap_area, list)->va_start);
+ flush_tlb_kernel_vas(&local_purge_list, num_entries + num_alias_entries);
- end = max(end,
- list_last_entry(&local_purge_list,
- struct vmap_area, list)->va_end);
+ if (num_alias_entries)
+ list_del(&alias_va.list);
+ } else {
+ flush_tlb_kernel_range(start, end);
+ }
- flush_tlb_kernel_range(start, end);
resched_threshold = lazy_max_pages() << 1;
spin_lock(&free_vmap_area_lock);
--
2.34.1
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately
2023-05-16 19:03 ` Thomas Gleixner
2023-05-17 9:38 ` Thomas Gleixner
2023-05-19 12:01 ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
@ 2023-05-19 12:02 ` Baoquan He
2023-05-19 12:03 ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
3 siblings, 0 replies; 75+ messages in thread
From: Baoquan He @ 2023-05-19 12:02 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
When freeing vmalloc range mapping, only unmapping the page tables is
done, TLB flush is lazily deferred to a late stage until
lazy_max_pages() is met or vmalloc() can't find available virtual memory
region.
However, to free VM_FLUSH_RESET_PERMS memory of vmalloc, TLB flushing
need be done immediately before freeing pages, and the direct map
needs resetting permissions and TLB flushing. Please see commit
868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
for more details.
In the current code, when freeing VM_FLUSH_RESET_PERMS memory, lazy
purge is also done to try to save a TLB flush later. When doing that, it
merges the direct map range with the percpu vbq dirty range and all
purge ranges by calculating flush range of [min:max]. That will add the
huge gap between direct map range and vmalloc range into the final TLB
flush range. So here, only flush VM_FLUSH_RESET_PERMS area immediately,
and leave the lazy flush to the normal points, e.g when allocating
a new vmap_area, or lazy_max_pages() is met.
Signed-off-by: Baoquan He <bhe@redhat.com>
---
mm/vmalloc.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 31e8d9e93650..87134dd8abc3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2690,9 +2690,10 @@ static inline void set_area_direct_map(const struct vm_struct *area,
*/
static void vm_reset_perms(struct vm_struct *area)
{
- unsigned long start = ULONG_MAX, end = 0;
+ unsigned long start = ULONG_MAX, end = 0, pages = 0;
unsigned int page_order = vm_area_page_order(area);
- int flush_dmap = 0;
+ struct list_head local_flush_list;
+ struct vmap_area alias_va, va;
int i;
/*
@@ -2708,17 +2709,33 @@ static void vm_reset_perms(struct vm_struct *area)
page_size = PAGE_SIZE << page_order;
start = min(addr, start);
end = max(addr + page_size, end);
- flush_dmap = 1;
}
}
+ va.va_start = (unsigned long)area->addr;
+ va.va_end = (unsigned long)(area->addr + area->size);
/*
* Set direct map to something invalid so that it won't be cached if
* there are any accesses after the TLB flush, then flush the TLB and
* reset the direct map permissions to the default.
*/
set_area_direct_map(area, set_direct_map_invalid_noflush);
- _vm_unmap_aliases(start, end, flush_dmap);
+ if (IS_ENABLED(CONFIG_HAVE_FLUSH_TLB_KERNEL_VAS)) {
+ if (end > start) {
+ pages = (end - start) >> PAGE_SHIFT;
+ alias_va.va_start = (unsigned long)start;
+ alias_va.va_end = (unsigned long)end;
+ list_add(&alias_va.list, &local_flush_list);
+ }
+
+ pages += area->size >> PAGE_SHIFT;
+ list_add(&va.list, &local_flush_list);
+
+ flush_tlb_kernel_vas(&local_flush_list, pages);
+ } else {
+ flush_tlb_kernel_range(start, end);
+ flush_tlb_kernel_range(va.va_start, va.va_end);
+ }
set_area_direct_map(area, set_direct_map_default_noflush);
}
--
2.34.1
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
* [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-16 19:03 ` Thomas Gleixner
` (2 preceding siblings ...)
2023-05-19 12:02 ` [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately Baoquan He
@ 2023-05-19 12:03 ` Baoquan He
2023-05-19 14:17 ` Thomas Gleixner
2023-05-19 18:38 ` Thomas Gleixner
3 siblings, 2 replies; 75+ messages in thread
From: Baoquan He @ 2023-05-19 12:03 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
After vb_free() invocation, the va will be purged and put into purge
tree/list if the entire vmap_block is dirty. If not entirely dirty, the
vmap_block is still in percpu vmap_block_queue list, just like below two
graphs:
(1)
|-----|------------|-----------|-------|
|dirty|still mapped| dirty | free |
2)
|------------------------------|-------|
| dirty | free |
In the current _vm_unmap_aliases(), to reclaim those unmapped range and
flush, it will iterate percpu vbq to calculate the range from vmap_block
like above two cases. Then call purge_fragmented_blocks_allcpus()
to purge the vmap_block in case 2 since no mapping exists right now,
and put these purged vmap_block va into purge tree/list. Then in
__purge_vmap_area_lazy(), it will continue calculating the flush range
from purge list. Obviously, this will take vmap_block va in the 2nd case
into account twice.
So here just move purge_fragmented_blocks_allcpus() up to purge the
vmap_block va of case 2 firstly, then only need iterate and count in
the dirty range in above 1st case. With the change, counting in the
dirty region of vmap_block in 1st case is now inside vmap_purge_lock
protection region, which makes the flush range calculation more
reasonable and accurate by avoiding concurrent operation in other cpu.
And also rename _vm_unmap_aliases() to vm_unmap_aliases(), since no
other caller except of the old vm_unmap_aliases().
Signed-off-by: Baoquan He <bhe@redhat.com>
---
mm/vmalloc.c | 45 ++++++++++++++++++++-------------------------
1 file changed, 20 insertions(+), 25 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 87134dd8abc3..9f7cbd6182ad 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2236,8 +2236,23 @@ static void vb_free(unsigned long addr, unsigned long size)
spin_unlock(&vb->lock);
}
-static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
+/**
+ * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
+ *
+ * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
+ * to amortize TLB flushing overheads. What this means is that any page you
+ * have now, may, in a former life, have been mapped into kernel virtual
+ * address by the vmap layer and so there might be some CPUs with TLB entries
+ * still referencing that page (additional to the regular 1:1 kernel mapping).
+ *
+ * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
+ * be sure that none of the pages we have control over will have any aliases
+ * from the vmap layer.
+ */
+void vm_unmap_aliases(void)
{
+ unsigned long start = ULONG_MAX, end = 0;
+ bool flush = false;
int cpu;
if (unlikely(!vmap_initialized))
@@ -2245,6 +2260,9 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
might_sleep();
+ mutex_lock(&vmap_purge_lock);
+ purge_fragmented_blocks_allcpus();
+
for_each_possible_cpu(cpu) {
struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
struct vmap_block *vb;
@@ -2262,40 +2280,17 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
start = min(s, start);
end = max(e, end);
- flush = 1;
+ flush = true;
}
spin_unlock(&vb->lock);
}
rcu_read_unlock();
}
- mutex_lock(&vmap_purge_lock);
- purge_fragmented_blocks_allcpus();
if (!__purge_vmap_area_lazy(start, end) && flush)
flush_tlb_kernel_range(start, end);
mutex_unlock(&vmap_purge_lock);
}
-
-/**
- * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer
- *
- * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily
- * to amortize TLB flushing overheads. What this means is that any page you
- * have now, may, in a former life, have been mapped into kernel virtual
- * address by the vmap layer and so there might be some CPUs with TLB entries
- * still referencing that page (additional to the regular 1:1 kernel mapping).
- *
- * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can
- * be sure that none of the pages we have control over will have any aliases
- * from the vmap layer.
- */
-void vm_unmap_aliases(void)
-{
- unsigned long start = ULONG_MAX, end = 0;
- int flush = 0;
-
- _vm_unmap_aliases(start, end, flush);
-}
EXPORT_SYMBOL_GPL(vm_unmap_aliases);
/**
--
2.34.1
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-16 9:03 ` Thomas Gleixner
2023-05-16 10:05 ` Baoquan He
@ 2023-05-19 13:49 ` Thomas Gleixner
1 sibling, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 13:49 UTC (permalink / raw)
To: Russell King (Oracle)
Cc: Andrew Morton, linux-mm, Christoph Hellwig, Uladzislau Rezki,
Lorenzo Stoakes, Peter Zijlstra, Baoquan He, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On Tue, May 16 2023 at 11:03, Thomas Gleixner wrote:
> On Tue, May 16 2023 at 09:27, Russell King wrote:
>> implementation, there is flush_tlb_kernel_page() though, which I
>> think is what you're referring to above. On ARM32, that will issue
>> one IPI each time it's called, and possibly another IPI for the
>> Cortex-A15 erratum.
>>
>> Given that, flush_tlb_kernel_range() is still going to be more
>> efficient on ARM32 when tlb_ops_need_broadcast() is true than doing
>> it page by page.
>
> Something like the untested below?
The list based individual flush wins over a tlb_flush_all() in that
particular scenario. It's almost in the noise while the TLB flush all
has an impact of ~1% on the other CPUs computation runtime.
That's just a quick check w/o deeper analysis, but that clearly shows
that there is potential.
Whether the list or as Nadav suggested some other storage turns out to
be also benefitial for IPI based flushing is still subject to
investigation.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-19 11:49 ` Baoquan He
@ 2023-05-19 14:13 ` Thomas Gleixner
0 siblings, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 14:13 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Fri, May 19 2023 at 19:49, Baoquan He wrote:
> On 05/19/23 at 01:22pm, Thomas Gleixner wrote:
>> >
>> > When vb_free() is invoked, it could cause three kinds of vmap_block as
>> > below. Your code works well for the 2nd case, for the 1st one, it may be
>> > not. And the 2nd one is the stuff that we reclaim and put into purge
>> > list in purge_fragmented_blocks_allcpus().
>> >
>> > 1)
>> > |-----|------------|-----------|-------|
>> > |dirty|still mapped| dirty | free |
>> >
>> > 2)
>> > |------------------------------|-------|
>> > | dirty | free |
>>
>>
>> You sure? The first one is put into the purge list too.
>
> No way. You don't copy the essential code here. The key line is
> calculation of vb->dirty. ->dirty_min and ->dirty_max only provides a
> loose vlaue for calculating the flush range. Counting more or less page
> of ->dirty_min or ->dirty_max won't impact much, just make flush do
> some menningless operation. While counting ->dirty wrong will cause
> serious problem. If you put case 1 into purge list, freeing it later
> will fail because you can't find it in vmap_area_root tree. Please check
> vfree() and remove_vm_area().
I misread your mail. My point was soleley about the ranges which get
flushed via unmap_aliases() before the block is completely freed,
i.e. ->free = 0 and ->dirty = VMAP_BBMAP_BITS.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one
2023-05-19 12:01 ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
@ 2023-05-19 14:16 ` Thomas Gleixner
0 siblings, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 14:16 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On Fri, May 19 2023 at 20:01, Baoquan He wrote:
> +
> + if (IS_ENABLED(CONFIG_HAVE_FLUSH_TLB_KERNEL_VAS)) {
> + list_for_each_entry(va, &local_purge_list, list)
> + num_entries += (va->va_end - va->va_start) >>
> PAGE_SHIFT;
That's still doing he list thing and as Nadav pointed out, we want
something else there. I'm working on it.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-19 12:03 ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
@ 2023-05-19 14:17 ` Thomas Gleixner
2023-05-19 18:38 ` Thomas Gleixner
1 sibling, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 14:17 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Fri, May 19 2023 at 20:03, Baoquan He wrote:
> @@ -2245,6 +2260,9 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
>
> might_sleep();
>
> + mutex_lock(&vmap_purge_lock);
> + purge_fragmented_blocks_allcpus();
> +
> for_each_possible_cpu(cpu) {
That still does TWO iterations over all possible CPUs, while this really
could be one.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-19 10:01 ` Uladzislau Rezki
@ 2023-05-19 14:56 ` Thomas Gleixner
2023-05-19 15:14 ` Uladzislau Rezki
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 14:56 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86,
Nadav Amit
On Fri, May 19 2023 at 12:01, Uladzislau Rezki wrote:
> On Wed, May 17, 2023 at 06:32:25PM +0200, Thomas Gleixner wrote:
>> That made me look into this coalescing code. I understand why you want
>> to batch and coalesce and rather do a rare full tlb flush than sending
>> gazillions of IPIs.
>>
> Your issues has no connections with merging. But the place you looked
> was correct :)
I'm not talking about merging. I'm talking about coalescing ranges.
start = 0x95c8d000 end = 0x95c8e000
plus the VA from list which has
start = 0xf08a1000 end = 0xf08a5000
which results in a flush range of:
start = 0x95c8d000 end = 0xf08a5000
No?
> @@ -1739,15 +1739,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> if (unlikely(list_empty(&local_purge_list)))
> goto out;
>
> - start = min(start,
> - list_first_entry(&local_purge_list,
> - struct vmap_area, list)->va_start);
> + /* OK. A per-cpu wants to flush an exact range. */
> + if (start != ULONG_MAX)
> + flush_tlb_kernel_range(start, end);
>
> - end = max(end,
> - list_last_entry(&local_purge_list,
> - struct vmap_area, list)->va_end);
> + /* Flush per-VA. */
> + list_for_each_entry(va, &local_purge_list, list)
> + flush_tlb_kernel_range(va->va_start, va->va_end);
>
> - flush_tlb_kernel_range(start, end);
> resched_threshold = lazy_max_pages() << 1;
That's completely wrong, really.
For the above case, which is easily enough to reproduce, this ends up
doing TWO IPIs on x86, which is worse than ONE IPI which ends up with a
flush all.
Aside of that if there are two VAs in the purge list and both are over
the threshold for doing a full flush then you end up with TWO flush all
IPIs in a row, which completely defeats the purpose of this whole
exercise.
As I demonstrated with the list approach for the above scenario this
avoids a full flush and needs only one IPI. Nadavs observation vs. the
list aside, this is clearly better than what you are proposing here.
The IPI cost on x86 is equally bad as the full barriers on arm[64].
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-19 14:56 ` Thomas Gleixner
@ 2023-05-19 15:14 ` Uladzislau Rezki
2023-05-19 16:32 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-19 15:14 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86,
Nadav Amit
On Fri, May 19, 2023 at 04:56:53PM +0200, Thomas Gleixner wrote:
> On Fri, May 19 2023 at 12:01, Uladzislau Rezki wrote:
> > On Wed, May 17, 2023 at 06:32:25PM +0200, Thomas Gleixner wrote:
> >> That made me look into this coalescing code. I understand why you want
> >> to batch and coalesce and rather do a rare full tlb flush than sending
> >> gazillions of IPIs.
> >>
> > Your issues has no connections with merging. But the place you looked
> > was correct :)
>
> I'm not talking about merging. I'm talking about coalescing ranges.
>
> start = 0x95c8d000 end = 0x95c8e000
>
> plus the VA from list which has
>
> start = 0xf08a1000 end = 0xf08a5000
>
> which results in a flush range of:
>
> start = 0x95c8d000 end = 0xf08a5000
> No?
>
Correct. 0x95c8d000 is a min, 0xf08a5000 is a max.
> > @@ -1739,15 +1739,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> > if (unlikely(list_empty(&local_purge_list)))
> > goto out;
> >
> > - start = min(start,
> > - list_first_entry(&local_purge_list,
> > - struct vmap_area, list)->va_start);
> > + /* OK. A per-cpu wants to flush an exact range. */
> > + if (start != ULONG_MAX)
> > + flush_tlb_kernel_range(start, end);
> >
> > - end = max(end,
> > - list_last_entry(&local_purge_list,
> > - struct vmap_area, list)->va_end);
> > + /* Flush per-VA. */
> > + list_for_each_entry(va, &local_purge_list, list)
> > + flush_tlb_kernel_range(va->va_start, va->va_end);
> >
> > - flush_tlb_kernel_range(start, end);
> > resched_threshold = lazy_max_pages() << 1;
>
> That's completely wrong, really.
>
Absolutely. That is why we do not flush a range per-VA ;-) I provided the
data just to show what happens if we do it! A per-VA flushing works when
a system is not capable of doing a full flush, so it has to do it page
by page. In this scenario we should bypass ranges(not mapped) which are
between VAs in a purge-list.
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-19 15:14 ` Uladzislau Rezki
@ 2023-05-19 16:32 ` Thomas Gleixner
2023-05-19 17:02 ` Uladzislau Rezki
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 16:32 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86,
Nadav Amit
On Fri, May 19 2023 at 17:14, Uladzislau Rezki wrote:
> On Fri, May 19, 2023 at 04:56:53PM +0200, Thomas Gleixner wrote:
>> > + /* Flush per-VA. */
>> > + list_for_each_entry(va, &local_purge_list, list)
>> > + flush_tlb_kernel_range(va->va_start, va->va_end);
>> >
>> > - flush_tlb_kernel_range(start, end);
>> > resched_threshold = lazy_max_pages() << 1;
>>
>> That's completely wrong, really.
>>
> Absolutely. That is why we do not flush a range per-VA ;-) I provided the
> data just to show what happens if we do it!
Seriously, you think you need to demonstrate that to me? Did you
actually read what I wrote?
"I understand why you want to batch and coalesce and rather do a rare
full tlb flush than sending gazillions of IPIs."
> A per-VA flushing works when a system is not capable of doing a full
> flush, so it has to do it page by page. In this scenario we should
> bypass ranges(not mapped) which are between VAs in a purge-list.
ARM32 has a full flush as does x86. Just ARM32 does not have a cutoff
for a full flush in flush_tlb_kernel_range(). That's easily fixable, but
the underlying problem remains.
The point is that coalescing the VA ranges blindly is also fundamentally
wrong:
start1 = 0x95c8d000 end1 = 0x95c8e000
start2 = 0xf08a1000 end2 = 0xf08a5000
--> start = 0x95c8d000 end = 0xf08a5000
So this ends up with:
if (end - start > flush_all_threshold)
ipi_flush_all();
else
ipi_flush_range();
So with the above example this ends up with flush_all(), but a
flush_vas() as I demonstrated with the list approach (ignore the storage
problem which is fixable) this results in
if (total_nr_pages > flush_all_threshold)
ipi_flush_all();
else
ipi_flush_vas();
and that ipi flushes 3 pages instead of taking out the whole TLB, which
results in a 1% gain on that machine. Not massive, but still.
The blind coalescing is also wrong if the resulting range is not giantic
but below the flush_all_threshold. Lets assume a threshold of 32 pages.
start1 = 0xf0800000 end1 = 0xf0802000 2 pages
start2 = 0xf081e000 end2 = 0xf0820000 2 pages
--> start = 0xf0800000 end = 0xf0820000
So because this does not qualify for a full flush and it should not,
this ends up flushing 32 pages one by one instead of flushing exactly
four.
IOW, the existing code is fully biased towards full flushes which is
wrong.
Just because this does not show up in your performance numbers on some
enterprise workload does not make it more correct.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Excessive TLB flush ranges
2023-05-19 16:32 ` Thomas Gleixner
@ 2023-05-19 17:02 ` Uladzislau Rezki
0 siblings, 0 replies; 75+ messages in thread
From: Uladzislau Rezki @ 2023-05-19 17:02 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Uladzislau Rezki, Russell King (Oracle), Andrew Morton, linux-mm,
Christoph Hellwig, Lorenzo Stoakes, Peter Zijlstra, Baoquan He,
John Ogness, linux-arm-kernel, Mark Rutland, Marc Zyngier, x86,
Nadav Amit
On Fri, May 19, 2023 at 06:32:42PM +0200, Thomas Gleixner wrote:
> On Fri, May 19 2023 at 17:14, Uladzislau Rezki wrote:
> > On Fri, May 19, 2023 at 04:56:53PM +0200, Thomas Gleixner wrote:
> >> > + /* Flush per-VA. */
> >> > + list_for_each_entry(va, &local_purge_list, list)
> >> > + flush_tlb_kernel_range(va->va_start, va->va_end);
> >> >
> >> > - flush_tlb_kernel_range(start, end);
> >> > resched_threshold = lazy_max_pages() << 1;
> >>
> >> That's completely wrong, really.
> >>
> > Absolutely. That is why we do not flush a range per-VA ;-) I provided the
> > data just to show what happens if we do it!
>
> Seriously, you think you need to demonstrate that to me? Did you
> actually read what I wrote?
>
> "I understand why you want to batch and coalesce and rather do a rare
> full tlb flush than sending gazillions of IPIs."
>
Yes i read it. Since i also mentioned about IPI and did not provide any
data, i did it later, just in case. I shared my observation and that is it.
> > A per-VA flushing works when a system is not capable of doing a full
> > flush, so it has to do it page by page. In this scenario we should
> > bypass ranges(not mapped) which are between VAs in a purge-list.
>
> ARM32 has a full flush as does x86. Just ARM32 does not have a cutoff
> for a full flush in flush_tlb_kernel_range(). That's easily fixable, but
> the underlying problem remains.
>
> The point is that coalescing the VA ranges blindly is also fundamentally
> wrong:
>
>
> start1 = 0x95c8d000 end1 = 0x95c8e000
> start2 = 0xf08a1000 end2 = 0xf08a5000
>
> --> start = 0x95c8d000 end = 0xf08a5000
>
> So this ends up with:
>
> if (end - start > flush_all_threshold)
> ipi_flush_all();
> else
> ipi_flush_range();
>
> So with the above example this ends up with flush_all(), but a
> flush_vas() as I demonstrated with the list approach (ignore the storage
> problem which is fixable) this results in
>
> if (total_nr_pages > flush_all_threshold)
> ipi_flush_all();
> else
> ipi_flush_vas();
>
> and that ipi flushes 3 pages instead of taking out the whole TLB, which
> results in a 1% gain on that machine. Not massive, but still.
>
> The blind coalescing is also wrong if the resulting range is not giantic
> but below the flush_all_threshold. Lets assume a threshold of 32 pages.
>
> start1 = 0xf0800000 end1 = 0xf0802000 2 pages
> start2 = 0xf081e000 end2 = 0xf0820000 2 pages
>
> --> start = 0xf0800000 end = 0xf0820000
>
> So because this does not qualify for a full flush and it should not,
> this ends up flushing 32 pages one by one instead of flushing exactly
> four.
>
> IOW, the existing code is fully biased towards full flushes which is
> wrong.
>
> Just because this does not show up in your performance numbers on some
> enterprise workload does not make it more correct.
>
Usually we do a flush of lazy-areas once the lazy_max_pages() threshold
is reached. There are exceptions. When an allocation fails, we drain the
areas(if there are any), second is a per-cpu allocator and last one is
vm_reset_perms() when "vm" is marked as VM_FLUSH_RESET_PERMS.
As for your description i totally see the problem.
--
Uladzislau Rezki
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-19 12:03 ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
2023-05-19 14:17 ` Thomas Gleixner
@ 2023-05-19 18:38 ` Thomas Gleixner
2023-05-19 23:46 ` Baoquan He
1 sibling, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-19 18:38 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On Fri, May 19 2023 at 20:03, Baoquan He wrote:
> After vb_free() invocation, the va will be purged and put into purge
> tree/list if the entire vmap_block is dirty. If not entirely dirty, the
> vmap_block is still in percpu vmap_block_queue list, just like below two
> graphs:
>
> (1)
> |-----|------------|-----------|-------|
> |dirty|still mapped| dirty | free |
>
> 2)
> |------------------------------|-------|
> | dirty | free |
>
> In the current _vm_unmap_aliases(), to reclaim those unmapped range and
> flush, it will iterate percpu vbq to calculate the range from vmap_block
> like above two cases. Then call purge_fragmented_blocks_allcpus()
> to purge the vmap_block in case 2 since no mapping exists right now,
> and put these purged vmap_block va into purge tree/list. Then in
> __purge_vmap_area_lazy(), it will continue calculating the flush range
> from purge list. Obviously, this will take vmap_block va in the 2nd case
> into account twice.
Which made me look deeper into purge_fragmented_blocks()
list_for_each_entry_rcu(vb, &vbq->free, free_list) {
if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
continue;
spin_lock(&vb->lock);
if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {
That means if an allocation does not find something in the free list
then this can happen:
vaddr = vb_alloc(size)
vaddr = new_vmap_block(order, gfp_mask);
vb_free(vaddr, size)
vb->dirty = 1ULL << order;
purge_fragmented_blocks()
purge(most_recently_allocated_block);
vaddr = vb_alloc(size)
vaddr = new_vmap_block(order, gfp_mask);
How does that make sense?
That block would have hundreds of pages left and is the most recently
allocated. So the next vb_alloc() has to reallocate one instead of using
the one which was allocated just before.
This clearly lacks a free space check so that blocks which have more
free space than a certain threshold are not thrown away prematurely.
Maybe it wants an age check too, so that blocks which are unused for a
long time can be recycled, but that's an orthogonal issue.
That aside your patch does still not address what I pointed out to you
and what my patch cures:
pages bits dirtymin dirtymax
vb_alloc(A) 255 0 - 254 VMAP_BBMAP_BITS 0
vb_alloc(B) 255 255 - 509 VMAP_BBMAP_BITS 0
vb_alloc(C) 255 510 - 764 VMAP_BBMAP_BITS 0
vb_alloc(D) 255 765 - 1020 VMAP_BBMAP_BITS 0
The block stays on the free list because there are still 4 pages left
and it stays there until either _all_ free space is used or _all_
allocated space is freed.
Now the first allocation gets freed:
vb_free(A) 255 0 - 254 0 254
From there on _every_ invocation of __purge_vmap_area_lazy() will see
this range as long as the block is on the free list:
list_for_each_entry_rcu(vb, &vbq->free, free_list) {
spin_lock(&vb->lock);
if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
because this condition is true. So this flushes the same range over and
over, no?
This flush range gets larger over time the more allocations are freed up
to the point where the block vanishes from the free list.
By resetting vb->dirty_min/max the freed range is only flushed _once_,
no? The resulting flush range might still be excessively large as I
pointed out before:
1) Flush after freeing A
vb_free(A) 2 0 - 1 0 1
Flush VMAP_BBMAP_BITS 0 <- correct
vb_free(C) 2 6 - 7 6 7
Flush VMAP_BBMAP_BITS 0 <- correct
2) No flush between freeing A and C
vb_free(A) 2 0 - 1 0 1
vb_free(C) 2 6 - 7 0 7
Flush VMAP_BBMAP_BITS 0 <- overbroad flush
3) No flush between freeing A, C, B
vb_free(A) 2 0 - 1 0 1
vb_free(B) 2 6 - 7 0 7
vb_free(C) 2 2 - 5 0 7
Flush VMAP_BBMAP_BITS 0 <- correct
Obviously case 2 could be
vb_free(A) 2 0 - 1 0 1
vb_free(X) 2 1000 - 1001 1000 1001
So that flush via purge_vmap_area_list() will ask to flush 1002 pages
instead of 4, right? Again, that does not make sense.
The other issue I pointed out:
Assume the block has (for simplicity) 255 allocations size of 4 pages,
again free space of 4 pages.
254 allocations are freed, which means there is one remaining
mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
time.
Now the last allocation is freed and the block is moved to the
purge_vmap_area_list, which then does a full flush of the complete area,
i.e. 4MB in that case, while in fact it only needs to flush 2 pages.
Also these intermediate flushes are inconsistent versus how fully
utilized blocks are handled:
vb_alloc()
if (vb->free == 0)
list_del(vb->free_list);
So all allocations which are freed after that point stay unflushed until
the last allocation is freed which moves the block to the
purge_vmap_area_list, where it gets a full VA range flush.
IOW, for blocks on the free list this cares about unflushed mappings of
freed spaces, but for fully utilized blocks with freed spaces it does
obviously not matter, right?
So either we care about flushing the mappings of freed spaces or we do
not, but caring in one case and ignoring it in the other case is
inconsistent at best.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-19 18:38 ` Thomas Gleixner
@ 2023-05-19 23:46 ` Baoquan He
2023-05-21 23:10 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-19 23:46 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86
On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
> On Fri, May 19 2023 at 20:03, Baoquan He wrote:
> > After vb_free() invocation, the va will be purged and put into purge
> > tree/list if the entire vmap_block is dirty. If not entirely dirty, the
> > vmap_block is still in percpu vmap_block_queue list, just like below two
> > graphs:
> >
> > (1)
> > |-----|------------|-----------|-------|
> > |dirty|still mapped| dirty | free |
> >
> > 2)
> > |------------------------------|-------|
> > | dirty | free |
> >
> > In the current _vm_unmap_aliases(), to reclaim those unmapped range and
> > flush, it will iterate percpu vbq to calculate the range from vmap_block
> > like above two cases. Then call purge_fragmented_blocks_allcpus()
> > to purge the vmap_block in case 2 since no mapping exists right now,
> > and put these purged vmap_block va into purge tree/list. Then in
> > __purge_vmap_area_lazy(), it will continue calculating the flush range
> > from purge list. Obviously, this will take vmap_block va in the 2nd case
> > into account twice.
>
> Which made me look deeper into purge_fragmented_blocks()
>
> list_for_each_entry_rcu(vb, &vbq->free, free_list) {
>
> if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
> continue;
>
> spin_lock(&vb->lock);
> if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {
>
> That means if an allocation does not find something in the free list
> then this can happen:
>
> vaddr = vb_alloc(size)
> vaddr = new_vmap_block(order, gfp_mask);
>
> vb_free(vaddr, size)
> vb->dirty = 1ULL << order;
>
> purge_fragmented_blocks()
> purge(most_recently_allocated_block);
>
> vaddr = vb_alloc(size)
> vaddr = new_vmap_block(order, gfp_mask);
>
> How does that make sense?
>
> That block would have hundreds of pages left and is the most recently
> allocated. So the next vb_alloc() has to reallocate one instead of using
> the one which was allocated just before.
>
> This clearly lacks a free space check so that blocks which have more
> free space than a certain threshold are not thrown away prematurely.
> Maybe it wants an age check too, so that blocks which are unused for a
> long time can be recycled, but that's an orthogonal issue.
You are right, the vmap_block alloc/free does have the issue you pointed
out here. What I can defend is that it should be fine if
VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
lazy flush will only be triggered when lazy_max_pages() is met, or
alloc_vmap_area() can't find an available range. If these two happens,
means we really need to flush and reclaim the unmapped area into free
list/tree since the vmalloc address space has run out. Even though the
vmap_block has mach free space left, still need be purged to cope with
an emergency.
So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and
set a threshold for vmap_block purging, is it better?
>
>
> That aside your patch does still not address what I pointed out to you
> and what my patch cures:
>
> pages bits dirtymin dirtymax
> vb_alloc(A) 255 0 - 254 VMAP_BBMAP_BITS 0
> vb_alloc(B) 255 255 - 509 VMAP_BBMAP_BITS 0
> vb_alloc(C) 255 510 - 764 VMAP_BBMAP_BITS 0
> vb_alloc(D) 255 765 - 1020 VMAP_BBMAP_BITS 0
>
> The block stays on the free list because there are still 4 pages left
> and it stays there until either _all_ free space is used or _all_
> allocated space is freed.
>
> Now the first allocation gets freed:
>
> vb_free(A) 255 0 - 254 0 254
>
> From there on _every_ invocation of __purge_vmap_area_lazy() will see
> this range as long as the block is on the free list:
>
> list_for_each_entry_rcu(vb, &vbq->free, free_list) {
> spin_lock(&vb->lock);
> if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
>
> because this condition is true. So this flushes the same range over and
> over, no?
>
> This flush range gets larger over time the more allocations are freed up
> to the point where the block vanishes from the free list.
>
> By resetting vb->dirty_min/max the freed range is only flushed _once_,
> no? The resulting flush range might still be excessively large as I
> pointed out before:
>
> 1) Flush after freeing A
>
> vb_free(A) 2 0 - 1 0 1
> Flush VMAP_BBMAP_BITS 0 <- correct
> vb_free(C) 2 6 - 7 6 7
> Flush VMAP_BBMAP_BITS 0 <- correct
>
> 2) No flush between freeing A and C
>
> vb_free(A) 2 0 - 1 0 1
> vb_free(C) 2 6 - 7 0 7
> Flush VMAP_BBMAP_BITS 0 <- overbroad flush
>
> 3) No flush between freeing A, C, B
>
> vb_free(A) 2 0 - 1 0 1
> vb_free(B) 2 6 - 7 0 7
> vb_free(C) 2 2 - 5 0 7
> Flush VMAP_BBMAP_BITS 0 <- correct
>
> Obviously case 2 could be
>
> vb_free(A) 2 0 - 1 0 1
> vb_free(X) 2 1000 - 1001 1000 1001
>
> So that flush via purge_vmap_area_list() will ask to flush 1002 pages
> instead of 4, right? Again, that does not make sense.
Yes, I got your point now. I didn't read your cure code carefully, sorry
for that.
>
> The other issue I pointed out:
>
> Assume the block has (for simplicity) 255 allocations size of 4 pages,
> again free space of 4 pages.
>
> 254 allocations are freed, which means there is one remaining
> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
> time.
>
> Now the last allocation is freed and the block is moved to the
> purge_vmap_area_list, which then does a full flush of the complete area,
> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.
It's easy to fix. For vmap_block, I have marked it in va->flag with
VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
vmap_block va. I don't know how you will tackle the per va flush
Nadav pointed out, so I will not give a dtaft code.
>
>
> Also these intermediate flushes are inconsistent versus how fully
> utilized blocks are handled:
>
> vb_alloc()
> if (vb->free == 0)
> list_del(vb->free_list);
>
> So all allocations which are freed after that point stay unflushed until
> the last allocation is freed which moves the block to the
> purge_vmap_area_list, where it gets a full VA range flush.
That may be risky if stay unflushed until the last allocation is freed.
We use vm_map_ram() interface to map passed in pages into vmalloc area.
If vb_free() is called, the sub-region has been unmapped and user maybe
have released the pages. user of vm_unmap_aliases() may be impacted if
we don't flush those area freed with vb_free(). In reality, those areas
have been unmapped, while there's still TLB existing. Not very sure
about that.
If we can hold the vmap_block flush until purging it w/o risk, it will
save us many troubles.
>
> IOW, for blocks on the free list this cares about unflushed mappings of
> freed spaces, but for fully utilized blocks with freed spaces it does
> obviously not matter, right?
Yes, while depends on how we flush them. Flush them each time if there's
dirty, or hold the flush until purged if holding is allowed.
>
> So either we care about flushing the mappings of freed spaces or we do
> not, but caring in one case and ignoring it in the other case is
> inconsistent at best.
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-19 23:46 ` Baoquan He
@ 2023-05-21 23:10 ` Thomas Gleixner
2023-05-22 11:21 ` Baoquan He
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-21 23:10 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On Sat, May 20 2023 at 07:46, Baoquan He wrote:
> On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
>> That block would have hundreds of pages left and is the most recently
>> allocated. So the next vb_alloc() has to reallocate one instead of using
>> the one which was allocated just before.
>>
>> This clearly lacks a free space check so that blocks which have more
>> free space than a certain threshold are not thrown away prematurely.
>> Maybe it wants an age check too, so that blocks which are unused for a
>> long time can be recycled, but that's an orthogonal issue.
>
> You are right, the vmap_block alloc/free does have the issue you pointed
> out here. What I can defend is that it should be fine if
> VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
> lazy flush will only be triggered when lazy_max_pages() is met, or
> alloc_vmap_area() can't find an available range. If these two happens,
> means we really need to flush and reclaim the unmapped area into free
> list/tree since the vmalloc address space has run out. Even though the
> vmap_block has mach free space left, still need be purged to cope with
> an emergency.
>
> So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and
That might be counterproductive depending on the usage scenario
especially as that BPF filtering is gaining traction.
> set a threshold for vmap_block purging, is it better?
The point is that there is no differentiation between:
1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold,
from drain_vmap_area_work()
2) Reclaiming vmalloc address space from pcpu_get_vm_areas()
3) _unmap_aliases()
#1 You really don't want to reclaim vmap blocks from the per cpu free
lists, unless they have only a few free pages left and no active
mappings.
There is no reason to zap vmap blocks unconditionally, simply because
reclaiming all lazy areas drains at least
32MB * fls(num_online_cpus())
per invocation which is plenty, no?
#2 You might want to try #1 first and if that does not help then reclaim
all vmap blocks on the per cpu lists which have no active mappings
unconditionally.
#3 _unmap_aliases() requires to touch everything because the caller has
no clue which vmap_area used a particular page and the vmap_area lost
that information too.
Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
vmap area first and then cares about the flush. That in turn requires
a full walk of _all_ vmap areas including the one which was just
added to the purge list.
I can see the point that this is used to combine outstanding TLB
flushes and do the housekeeping of purging freed areas, but like #1
there is no real good reason to zap usable vmap blocks
unconditionally.
I'm all for consolidating functionality, but that swiss army knife
approach of one fits it all does not make sense to me.
>> The other issue I pointed out:
>>
>> Assume the block has (for simplicity) 255 allocations size of 4 pages,
>> again free space of 4 pages.
>>
>> 254 allocations are freed, which means there is one remaining
>> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
>> time.
>>
>> Now the last allocation is freed and the block is moved to the
>> purge_vmap_area_list, which then does a full flush of the complete area,
>> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.
>
> It's easy to fix. For vmap_block, I have marked it in va->flag with
> VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
> vmap_block va.
How can you skip that? The last 2 pages in that VA still need to be
flushed, no?
But VA has no information about already done partial flushes via the
vmap_block, so this requires flushing the full VA range unconditionally.
That made me think about the following scenario:
vm_map_ram()
vb_alloc()
// Uses page P
vb->free -= 1UL << order;
if (vb->free == 0)
list_del_rcu(&vb->free_list);
vm_unmap_ram()
vb_free()
Does not free the last mapping of the vmap_block
Clears the page table entry for page P, but does not flush TLB
__free_page(P)
page P gets reused
vm_unmap_aliases()
Does not find the VA which used page P because neither the VB is in
free_list nor the VA is in the purge_list
Unless _vm_unmap_aliases() finds a large enough range to cover
page P or ends up with a flush_tlb_all() the stale TLB(s) persists.
No?
Same problem in this scenario:
addr = vm_map_ram()
vb_alloc()
vb->free -= 1UL << order;
if (vb->free == 0)
list_del_rcu(&vb->free_list);
set_memory_*(addr)
vm_unmap_aliases()
Does not find the VA which contains @addr because neither the VB is in
free_list nor the VA is in the purge_list
Unless _vm_unmap_aliases() finds a large enough range to cover
@addr or ends up with a flush_tlb_all() the stale TLB(s) persists.
No?
> I don't know how you will tackle the per va flush Nadav pointed out,
> so I will not give a dtaft code.
It's not that hard.
An array of ranges which can be memcpy()'ed into that existing x86
percpu flush info thing, which avoids the cache line issue Nadav is
concerned of.
That array would have obviously a limited number of entries as anything
with multiple ranges is most likely going to end up in a full flush
anyway. The drain_vmap_area_work() case definitely does so as
vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth
of pages.
I just got distracted and did not come around to finish it for various
reasons. One of them is to make sense of this code :)
>> Also these intermediate flushes are inconsistent versus how fully
>> utilized blocks are handled:
>>
>> vb_alloc()
>> if (vb->free == 0)
>> list_del(vb->free_list);
>>
>> So all allocations which are freed after that point stay unflushed until
>> the last allocation is freed which moves the block to the
>> purge_vmap_area_list, where it gets a full VA range flush.
>
> That may be risky if stay unflushed until the last allocation is freed.
> We use vm_map_ram() interface to map passed in pages into vmalloc area.
> If vb_free() is called, the sub-region has been unmapped and user maybe
> have released the pages. user of vm_unmap_aliases() may be impacted if
> we don't flush those area freed with vb_free(). In reality, those areas
> have been unmapped, while there's still TLB existing. Not very sure
> about that.
>
> If we can hold the vmap_block flush until purging it w/o risk, it will
> save us many troubles.
The point is that the TLBs _must_ be flushed
1) Before the virtual address space occupied by a vmap_area is
released and can be reused.
So one might argue that flushing TLBs early is good to detect use
after free.
The DEBUG_PAGE_ALLOC case flushes right after
vunmap_range_noflush(), which is the proper thing to do for
debugging as any UAF will be immediately detected after the
function returns.
The production case flushes lazy and fully utilized blocks are not
on the free list and therefore not flushed until they are
purged.
The partial flush via _vm_unmap_aliases() solves a completely
different problem. See #2/#3 below
2) Before a page, which might have been reused before the lazy flush
happened, can be used again. Also when the mapping attributes of a
page are changed via set_memory_*()
That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for.
3) When an executable mapping is torn down
That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up
in _vm_unmap_aliases()
I completely understand that there needs to be a balance between the
high utilization use case and the occasional one which made us look into
this. That's fine, but there needs to be a way to make it least overhead
for both.
/me tries to find some spare cycles ...
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-21 23:10 ` Thomas Gleixner
@ 2023-05-22 11:21 ` Baoquan He
2023-05-22 12:02 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-22 11:21 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On 05/22/23 at 01:10am, Thomas Gleixner wrote:
> On Sat, May 20 2023 at 07:46, Baoquan He wrote:
> > On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
> >> That block would have hundreds of pages left and is the most recently
> >> allocated. So the next vb_alloc() has to reallocate one instead of using
> >> the one which was allocated just before.
> >>
> >> This clearly lacks a free space check so that blocks which have more
> >> free space than a certain threshold are not thrown away prematurely.
> >> Maybe it wants an age check too, so that blocks which are unused for a
> >> long time can be recycled, but that's an orthogonal issue.
> >
> > You are right, the vmap_block alloc/free does have the issue you pointed
> > out here. What I can defend is that it should be fine if
> > VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
> > lazy flush will only be triggered when lazy_max_pages() is met, or
> > alloc_vmap_area() can't find an available range. If these two happens,
> > means we really need to flush and reclaim the unmapped area into free
> > list/tree since the vmalloc address space has run out. Even though the
> > vmap_block has mach free space left, still need be purged to cope with
> > an emergency.
> >
> > So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and
>
> That might be counterproductive depending on the usage scenario
> especially as that BPF filtering is gaining traction.
>
> > set a threshold for vmap_block purging, is it better?
>
> The point is that there is no differentiation between:
>
> 1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold,
> from drain_vmap_area_work()
>
> 2) Reclaiming vmalloc address space from pcpu_get_vm_areas()
>
> 3) _unmap_aliases()
>
> #1 You really don't want to reclaim vmap blocks from the per cpu free
> lists, unless they have only a few free pages left and no active
> mappings.
>
> There is no reason to zap vmap blocks unconditionally, simply because
> reclaiming all lazy areas drains at least
>
> 32MB * fls(num_online_cpus())
>
> per invocation which is plenty, no?
>
> #2 You might want to try #1 first and if that does not help then reclaim
> all vmap blocks on the per cpu lists which have no active mappings
> unconditionally.
>
> #3 _unmap_aliases() requires to touch everything because the caller has
> no clue which vmap_area used a particular page and the vmap_area lost
> that information too.
>
> Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
> vmap area first and then cares about the flush. That in turn requires
> a full walk of _all_ vmap areas including the one which was just
> added to the purge list.
>
> I can see the point that this is used to combine outstanding TLB
> flushes and do the housekeeping of purging freed areas, but like #1
> there is no real good reason to zap usable vmap blocks
> unconditionally.
>
> I'm all for consolidating functionality, but that swiss army knife
> approach of one fits it all does not make sense to me.
OK, got it.
>
> >> The other issue I pointed out:
> >>
> >> Assume the block has (for simplicity) 255 allocations size of 4 pages,
> >> again free space of 4 pages.
> >>
> >> 254 allocations are freed, which means there is one remaining
> >> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
> >> time.
> >>
> >> Now the last allocation is freed and the block is moved to the
> >> purge_vmap_area_list, which then does a full flush of the complete area,
> >> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.
Agree, it sounds problematic. With your cure code, we can check that
when we flush per va of purge list. If we can keep vmap_block() not
freed until it's checked and freed in purge list, we can flush the last
two pages of range in __purge_vmap_area_lazy(), please see next comment
about how we could do to fix it.
> >
> > It's easy to fix. For vmap_block, I have marked it in va->flag with
> > VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
> > vmap_block va.
>
> How can you skip that? The last 2 pages in that VA still need to be
> flushed, no?
>
> But VA has no information about already done partial flushes via the
> vmap_block, so this requires flushing the full VA range unconditionally.
Right, I see the problem you spotted with the illutrations. For the
last 2 pages in that VA, we can keep vmap_block until we iterate the
purge list in __purge_vmap_area_lazy(). Since you will iterate the
purge list to add each va range to an array, we can do like below
draft code to fix it. Surely this is on top of your cure code.
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5ca55b357148..4b11a32df49d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
unsigned int num_purged_areas = 0;
struct list_head local_purge_list;
struct vmap_area *va, *n_va;
+ struct vmap_block vb;
lockdep_assert_held(&vmap_purge_lock);
@@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
list_replace_init(&purge_vmap_area_list, &local_purge_list);
spin_unlock(&purge_vmap_area_lock);
+ vb = container_of(va, struct vmap_block, va);
+ if ((va->flags & VMAP_FLAGS_MASK == VMAP_RAM|VMAP_BLOCK) &&
+ (vb->dirty_max)) {
+ s = vb->dirty_min << PAGE_SHIFT;
+ e = vb->dirty_max << PAGE_SHIFT;
+ kfree(vb);
+ }
+
if (unlikely(list_empty(&local_purge_list)))
goto out;
@@ -2083,7 +2092,6 @@ static void free_vmap_block(struct vmap_block *vb)
spin_unlock(&vmap_area_lock);
free_vmap_area_noflush(vb->va);
- kfree_rcu(vb, rcu_head);
}
static void purge_fragmented_blocks(int cpu)
@@ -2163,11 +2171,6 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
vaddr = vmap_block_vaddr(vb->va->va_start, pages_off);
vb->free -= 1UL << order;
bitmap_set(vb->used_map, pages_off, (1UL << order));
- if (vb->free == 0) {
- spin_lock(&vbq->lock);
- list_del_rcu(&vb->free_list);
- spin_unlock(&vbq->lock);
- }
spin_unlock(&vb->lock);
break;
>
> That made me think about the following scenario:
>
> vm_map_ram()
> vb_alloc()
> // Uses page P
> vb->free -= 1UL << order;
> if (vb->free == 0)
> list_del_rcu(&vb->free_list);
>
> vm_unmap_ram()
> vb_free()
> Does not free the last mapping of the vmap_block
> Clears the page table entry for page P, but does not flush TLB
>
> __free_page(P)
>
> page P gets reused
>
> vm_unmap_aliases()
>
> Does not find the VA which used page P because neither the VB is in
> free_list nor the VA is in the purge_list
>
> Unless _vm_unmap_aliases() finds a large enough range to cover
> page P or ends up with a flush_tlb_all() the stale TLB(s) persists.
>
> No?
>
> Same problem in this scenario:
>
> addr = vm_map_ram()
> vb_alloc()
> vb->free -= 1UL << order;
> if (vb->free == 0)
> list_del_rcu(&vb->free_list);
>
> set_memory_*(addr)
> vm_unmap_aliases()
>
> Does not find the VA which contains @addr because neither the VB is in
> free_list nor the VA is in the purge_list
>
> Unless _vm_unmap_aliases() finds a large enough range to cover
> @addr or ends up with a flush_tlb_all() the stale TLB(s) persists.
>
> No?
>
> > I don't know how you will tackle the per va flush Nadav pointed out,
> > so I will not give a dtaft code.
>
> It's not that hard.
>
> An array of ranges which can be memcpy()'ed into that existing x86
> percpu flush info thing, which avoids the cache line issue Nadav is
> concerned of.
>
> That array would have obviously a limited number of entries as anything
> with multiple ranges is most likely going to end up in a full flush
> anyway. The drain_vmap_area_work() case definitely does so as
> vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth
> of pages.
>
> I just got distracted and did not come around to finish it for various
> reasons. One of them is to make sense of this code :)
OK, looking forward to seeing it done when it's convenient to you.
>
> >> Also these intermediate flushes are inconsistent versus how fully
> >> utilized blocks are handled:
> >>
> >> vb_alloc()
> >> if (vb->free == 0)
> >> list_del(vb->free_list);
> >>
> >> So all allocations which are freed after that point stay unflushed until
> >> the last allocation is freed which moves the block to the
> >> purge_vmap_area_list, where it gets a full VA range flush.
> >
> > That may be risky if stay unflushed until the last allocation is freed.
> > We use vm_map_ram() interface to map passed in pages into vmalloc area.
> > If vb_free() is called, the sub-region has been unmapped and user maybe
> > have released the pages. user of vm_unmap_aliases() may be impacted if
> > we don't flush those area freed with vb_free(). In reality, those areas
> > have been unmapped, while there's still TLB existing. Not very sure
> > about that.
> >
> > If we can hold the vmap_block flush until purging it w/o risk, it will
> > save us many troubles.
>
> The point is that the TLBs _must_ be flushed
>
> 1) Before the virtual address space occupied by a vmap_area is
> released and can be reused.
>
> So one might argue that flushing TLBs early is good to detect use
> after free.
>
> The DEBUG_PAGE_ALLOC case flushes right after
> vunmap_range_noflush(), which is the proper thing to do for
> debugging as any UAF will be immediately detected after the
> function returns.
>
> The production case flushes lazy and fully utilized blocks are not
> on the free list and therefore not flushed until they are
> purged.
>
> The partial flush via _vm_unmap_aliases() solves a completely
> different problem. See #2/#3 below
>
> 2) Before a page, which might have been reused before the lazy flush
> happened, can be used again. Also when the mapping attributes of a
> page are changed via set_memory_*()
>
> That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for.
>
> 3) When an executable mapping is torn down
>
> That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up
> in _vm_unmap_aliases()
I see it now, thanks.
>
> I completely understand that there needs to be a balance between the
> high utilization use case and the occasional one which made us look into
> this. That's fine, but there needs to be a way to make it least overhead
> for both.
Agree. Based on all discussions, it should be much better if below three
things are done. I haven't considered if adding dirty bitmap in
vmap_block is necessary and acceptable.
1)A threshold for vmap_block purging, e.g 3/4 or 2/3 of VMAP_BBMAP_BITS;
2)your cure page + above draft code I just pasted;
3)per va range adding to an array and passed to flush_tlb_kernel_xx();
>
> /me tries to find some spare cycles ...
>
> Thanks,
>
> tglx
>
>
>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-22 11:21 ` Baoquan He
@ 2023-05-22 12:02 ` Thomas Gleixner
2023-05-22 14:34 ` Baoquan He
0 siblings, 1 reply; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-22 12:02 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On Mon, May 22 2023 at 19:21, Baoquan He wrote:
> On 05/22/23 at 01:10am, Thomas Gleixner wrote:
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5ca55b357148..4b11a32df49d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> unsigned int num_purged_areas = 0;
> struct list_head local_purge_list;
> struct vmap_area *va, *n_va;
> + struct vmap_block vb;
>
> lockdep_assert_held(&vmap_purge_lock);
>
> @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> list_replace_init(&purge_vmap_area_list, &local_purge_list);
> spin_unlock(&purge_vmap_area_lock);
>
> + vb = container_of(va, struct vmap_block, va);
This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
is a pointer. vmap_area does not link back to vmap_block, so there is no
way to find it based on a vmap_area.
Aside of that va is not initialized here :)
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-22 12:02 ` Thomas Gleixner
@ 2023-05-22 14:34 ` Baoquan He
2023-05-22 20:21 ` Thomas Gleixner
0 siblings, 1 reply; 75+ messages in thread
From: Baoquan He @ 2023-05-22 14:34 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On 05/22/23 at 02:02pm, Thomas Gleixner wrote:
> On Mon, May 22 2023 at 19:21, Baoquan He wrote:
> > On 05/22/23 at 01:10am, Thomas Gleixner wrote:
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5ca55b357148..4b11a32df49d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> > unsigned int num_purged_areas = 0;
> > struct list_head local_purge_list;
> > struct vmap_area *va, *n_va;
> > + struct vmap_block vb;
> >
> > lockdep_assert_held(&vmap_purge_lock);
> >
> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> > list_replace_init(&purge_vmap_area_list, &local_purge_list);
> > spin_unlock(&purge_vmap_area_lock);
> >
> > + vb = container_of(va, struct vmap_block, va);
>
> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
> is a pointer. vmap_area does not link back to vmap_block, so there is no
> way to find it based on a vmap_area.
Oh, the code is buggy. va->flags can tell if it's vmap_block, then we
can deduce the vb pointer.
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5ca55b357148..73d6ce441351 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
unsigned int num_purged_areas = 0;
struct list_head local_purge_list;
struct vmap_area *va, *n_va;
+ struct vmap_block vb;
lockdep_assert_held(&vmap_purge_lock);
@@ -1736,6 +1737,15 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
list_replace_init(&purge_vmap_area_list, &local_purge_list);
spin_unlock(&purge_vmap_area_lock);
+ if (va->flags & VMAP_FLAGS_MASK == VMAP_RAM|VMAP_BLOCK) {
+ vb = container_of(va, struct vmap_block, va);
+ if (vb->dirty_max) {
/*This is pseudo code for illustration*/
+ s = vb->dirty_min << PAGE_SHIFT;
+ e = vb->dirty_max << PAGE_SHIFT;
+ }
+ kfree(vb);
+ }
+
if (unlikely(list_empty(&local_purge_list)))
goto out;
@@ -2083,7 +2093,6 @@ static void free_vmap_block(struct vmap_block *vb)
spin_unlock(&vmap_area_lock);
free_vmap_area_noflush(vb->va);
- kfree_rcu(vb, rcu_head);
}
>
> Aside of that va is not initialized here :)
Oh, this is not real code, just to illustrate how it can calculate and
flush the last two pages of vmap_block. If you have the per va flushing
via array patch, I can work out formal code change based on that.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-22 14:34 ` Baoquan He
@ 2023-05-22 20:21 ` Thomas Gleixner
2023-05-22 20:44 ` Thomas Gleixner
2023-05-23 9:35 ` Baoquan He
0 siblings, 2 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-22 20:21 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On Mon, May 22 2023 at 22:34, Baoquan He wrote:
> On 05/22/23 at 02:02pm, Thomas Gleixner wrote:
>> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>> > list_replace_init(&purge_vmap_area_list, &local_purge_list);
>> > spin_unlock(&purge_vmap_area_lock);
>> >
>> > + vb = container_of(va, struct vmap_block, va);
>>
>> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
>> is a pointer. vmap_area does not link back to vmap_block, so there is no
>> way to find it based on a vmap_area.
>
> Oh, the code is buggy. va->flags can tell if it's vmap_block, then we
> can deduce the vb pointer.
No. It _CANNOT_ work whether you check the flags or not.
struct foo {
.....
struct bar bar;
};
container_of(ptr_to_bar, struct foo, bar) returns the pointer to the
struct foo which has struct bar embedded.
But
struct foo {
.....
struct bar *bar;
};
cannot do that because ptr_to_bar points to some object which is
completely disconnected from struct foo.
Care to look at the implementation of container_of()?
Here is what it boils down to:
void *member_pointer = bar;
p = (struct foo *)(member_pointer - offsetof(struct foo, bar);
So it uses the pointer to bar and subtracts the offset of bar in struct
foo. This obviously can only work when struct bar is embedded in struct
foo.
Lets assume that *bar is the first member of foo, i.e. offset of *bar in
struct foo is 0
p = (struct foo *)(member_pointer - 0);
So you end up with
p == member_pointer == bar
But you won't get there because the static_assert() in container_of()
will catch that and the compiler will tell you in colourful ways.
Once the vmap area is handed over for cleaning up the vmap block is gone
and even if you let it stay around then the vmap area does not have any
information where to find the block.
You'd need to have a pointer to the vmap block in vmap area or embed
vmap area into vmap block.
See?
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-22 20:21 ` Thomas Gleixner
@ 2023-05-22 20:44 ` Thomas Gleixner
2023-05-23 9:35 ` Baoquan He
1 sibling, 0 replies; 75+ messages in thread
From: Thomas Gleixner @ 2023-05-22 20:44 UTC (permalink / raw)
To: Baoquan He
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On Mon, May 22 2023 at 22:21, Thomas Gleixner wrote:
> Lets assume that *bar is the first member of foo, i.e. offset of *bar in
> struct foo is 0
>
> p = (struct foo *)(member_pointer - 0);
>
> So you end up with
>
> p == member_pointer == bar
>
> But you won't get there because the static_assert() in container_of()
> will catch that and the compiler will tell you in colourful ways.
>
> Once the vmap area is handed over for cleaning up the vmap block is gone
> and even if you let it stay around then the vmap area does not have any
> information where to find the block.
>
> You'd need to have a pointer to the vmap block in vmap area or embed
> vmap area into vmap block.
The latter would require to:
- split alloc_vmap_area() apart
- sprinkle 'if (vmap_area_is_vmap_block(va))' all over the place
- do a lot of other nasty things
Not sure if that's worth it. There are some other options to pursue.
Thanks,
tglx
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly
2023-05-22 20:21 ` Thomas Gleixner
2023-05-22 20:44 ` Thomas Gleixner
@ 2023-05-23 9:35 ` Baoquan He
1 sibling, 0 replies; 75+ messages in thread
From: Baoquan He @ 2023-05-23 9:35 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Russell King (Oracle), Andrew Morton, linux-mm, Christoph Hellwig,
Uladzislau Rezki, Lorenzo Stoakes, Peter Zijlstra, John Ogness,
linux-arm-kernel, Mark Rutland, Marc Zyngier, x86, Nadav Amit
On 05/22/23 at 10:21pm, Thomas Gleixner wrote:
> On Mon, May 22 2023 at 22:34, Baoquan He wrote:
> > On 05/22/23 at 02:02pm, Thomas Gleixner wrote:
> >> > @@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> >> > list_replace_init(&purge_vmap_area_list, &local_purge_list);
> >> > spin_unlock(&purge_vmap_area_lock);
> >> >
> >> > + vb = container_of(va, struct vmap_block, va);
> >>
> >> This cannot work vmap_area is not embedded in vmap_block. vmap_block::va
> >> is a pointer. vmap_area does not link back to vmap_block, so there is no
> >> way to find it based on a vmap_area.
> >
> > Oh, the code is buggy. va->flags can tell if it's vmap_block, then we
> > can deduce the vb pointer.
>
> No. It _CANNOT_ work whether you check the flags or not.
>
> struct foo {
> .....
> struct bar bar;
> };
>
> container_of(ptr_to_bar, struct foo, bar) returns the pointer to the
> struct foo which has struct bar embedded.
>
> But
>
> struct foo {
> .....
> struct bar *bar;
> };
>
> cannot do that because ptr_to_bar points to some object which is
> completely disconnected from struct foo.
>
> Care to look at the implementation of container_of()?
>
> Here is what it boils down to:
>
> void *member_pointer = bar;
>
> p = (struct foo *)(member_pointer - offsetof(struct foo, bar);
>
> So it uses the pointer to bar and subtracts the offset of bar in struct
> foo. This obviously can only work when struct bar is embedded in struct
> foo.
>
> Lets assume that *bar is the first member of foo, i.e. offset of *bar in
> struct foo is 0
>
> p = (struct foo *)(member_pointer - 0);
>
> So you end up with
>
> p == member_pointer == bar
>
> But you won't get there because the static_assert() in container_of()
> will catch that and the compiler will tell you in colourful ways.
Thanks a lot, learn it now. I never noticed container_of() is not
suitable for pointer member of struct case.
>
> Once the vmap area is handed over for cleaning up the vmap block is gone
> and even if you let it stay around then the vmap area does not have any
> information where to find the block.
>
> You'd need to have a pointer to the vmap block in vmap area or embed
> vmap area into vmap block.
Got it now. Embedding vmap_area into vmap_block seems not feasible
because va need be reused when inserting into free_vmap_area_root/list.
Adding a pointer to vmap_block looks do-able.
Since vm_map_ram area doesn't have vm_struct associated with it, we can
reuse the space of '->vm' to add vb pointer like below. Since in the
existing code there are places where we use 'if(!va->vm)' to check if
it's a normal va, we need be careful to find all of them out and replace
with new and tighter checking. Will give a draft code change after all
is done and testing is passed.
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..e2ba6d59d679 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -15,6 +15,7 @@
struct vm_area_struct; /* vma defining user mapping in mm_types.h */
struct notifier_block; /* in notifier.h */
struct iov_iter; /* in uio.h */
+struct vmap_block; /* in mm/vmalloc.c */
/* bits in flags of vmalloc's vm_struct below */
#define VM_IOREMAP 0x00000001 /* ioremap() and friends */
@@ -76,6 +77,7 @@ struct vmap_area {
union {
unsigned long subtree_max_size; /* in "free" tree */
struct vm_struct *vm; /* in "busy" tree */
+ struct vmap_block *vb; /* in "busy and purge" tree */
};
unsigned long flags; /* mark type of vm_map_ram area */
};
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c0f80982eb06..d97343271e27 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2061,6 +2061,10 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
return ERR_PTR(err);
}
+ spin_lock(&vmap_area_lock);
+ va->vb = vb;
+ spin_unlock(&vmap_area_lock);
+
vbq = raw_cpu_ptr(&vmap_block_queue);
spin_lock(&vbq->lock);
list_add_tail_rcu(&vb->free_list, &vbq->free);
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 75+ messages in thread
end of thread, other threads:[~2023-05-23 9:36 UTC | newest]
Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-15 16:43 Excessive TLB flush ranges Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 19:46 ` Thomas Gleixner
2023-05-15 21:11 ` Thomas Gleixner
2023-05-15 21:31 ` Russell King (Oracle)
2023-05-16 6:37 ` Thomas Gleixner
2023-05-16 6:46 ` Thomas Gleixner
2023-05-16 8:18 ` Thomas Gleixner
2023-05-16 8:20 ` Thomas Gleixner
2023-05-16 8:27 ` Russell King (Oracle)
2023-05-16 9:03 ` Thomas Gleixner
2023-05-16 10:05 ` Baoquan He
2023-05-16 14:21 ` Thomas Gleixner
2023-05-16 19:03 ` Thomas Gleixner
2023-05-17 9:38 ` Thomas Gleixner
2023-05-17 10:52 ` Baoquan He
2023-05-19 11:22 ` Thomas Gleixner
2023-05-19 11:49 ` Baoquan He
2023-05-19 14:13 ` Thomas Gleixner
2023-05-19 12:01 ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
2023-05-19 14:16 ` Thomas Gleixner
2023-05-19 12:02 ` [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately Baoquan He
2023-05-19 12:03 ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
2023-05-19 14:17 ` Thomas Gleixner
2023-05-19 18:38 ` Thomas Gleixner
2023-05-19 23:46 ` Baoquan He
2023-05-21 23:10 ` Thomas Gleixner
2023-05-22 11:21 ` Baoquan He
2023-05-22 12:02 ` Thomas Gleixner
2023-05-22 14:34 ` Baoquan He
2023-05-22 20:21 ` Thomas Gleixner
2023-05-22 20:44 ` Thomas Gleixner
2023-05-23 9:35 ` Baoquan He
2023-05-19 13:49 ` Excessive TLB flush ranges Thomas Gleixner
2023-05-16 8:21 ` Russell King (Oracle)
2023-05-16 8:19 ` Russell King (Oracle)
2023-05-16 8:44 ` Thomas Gleixner
2023-05-16 8:48 ` Russell King (Oracle)
2023-05-16 12:09 ` Thomas Gleixner
2023-05-16 13:42 ` Uladzislau Rezki
2023-05-16 14:38 ` Thomas Gleixner
2023-05-16 15:01 ` Uladzislau Rezki
2023-05-16 17:04 ` Thomas Gleixner
2023-05-17 11:26 ` Uladzislau Rezki
2023-05-17 11:58 ` Thomas Gleixner
2023-05-17 12:15 ` Uladzislau Rezki
2023-05-17 16:32 ` Thomas Gleixner
2023-05-19 10:01 ` Uladzislau Rezki
2023-05-19 14:56 ` Thomas Gleixner
2023-05-19 15:14 ` Uladzislau Rezki
2023-05-19 16:32 ` Thomas Gleixner
2023-05-19 17:02 ` Uladzislau Rezki
2023-05-16 17:56 ` Nadav Amit
2023-05-16 19:32 ` Thomas Gleixner
2023-05-17 0:23 ` Thomas Gleixner
2023-05-17 1:23 ` Nadav Amit
2023-05-17 10:31 ` Thomas Gleixner
2023-05-17 11:47 ` Thomas Gleixner
2023-05-17 22:41 ` Nadav Amit
2023-05-17 14:43 ` Mark Rutland
2023-05-17 16:41 ` Thomas Gleixner
2023-05-17 22:57 ` Nadav Amit
2023-05-19 11:49 ` Thomas Gleixner
2023-05-17 12:12 ` Russell King (Oracle)
2023-05-17 23:14 ` Nadav Amit
2023-05-15 18:17 ` Uladzislau Rezki
2023-05-16 2:26 ` Baoquan He
2023-05-16 6:40 ` Thomas Gleixner
2023-05-16 8:07 ` Baoquan He
2023-05-16 8:10 ` Baoquan He
2023-05-16 8:45 ` Russell King (Oracle)
2023-05-16 9:13 ` Thomas Gleixner
2023-05-16 8:54 ` Thomas Gleixner
2023-05-16 9:48 ` Baoquan He
2023-05-15 20:02 ` Nadav Amit
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).