KASAN-related VMAP allocation errors in debug kernels with many logical CPUS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
@ 2022-10-06 13:46 David Hildenbrand
  2022-10-06 15:35 ` Uladzislau Rezki
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2022-10-06 13:46 UTC (permalink / raw)
  To: Alexander Potapenko, Uladzislau Rezki (Sony), Andrey Konovalov
  Cc: linux-mm@kvack.org, Andrey Ryabinin, Dmitry Vyukov,
	Vincenzo Frascino, kasan-dev

Hi,

we're currently hitting a weird vmap issue in debug kernels with KASAN enabled
on fairly large VMs. I reproduced it on v5.19 (did not get the chance to
try 6.0 yet because I don't have access to the machine right now, but
I suspect it persists).

It seems to trigger when udev probes a massive amount of devices in parallel
while the system is booting up. Once the system booted, I no longer see any
such issues.

[  165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[  165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[  165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
[  165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
[  165.947343] Call Trace:
[  165.950075]  <TASK>
[  165.952425]  dump_stack_lvl+0x57/0x81
[  165.956532]  warn_alloc.cold+0x95/0x18a
[  165.960836]  ? zone_watermark_ok_safe+0x240/0x240
[  165.966100]  ? slab_free_freelist_hook+0x11d/0x1d0
[  165.971461]  ? __get_vm_area_node+0x2af/0x360
[  165.976341]  ? __get_vm_area_node+0x2af/0x360
[  165.981219]  __vmalloc_node_range+0x291/0x560
[  165.986087]  ? __mutex_unlock_slowpath+0x161/0x5e0
[  165.991447]  ? move_module+0x4c/0x630
[  165.995547]  ? vfree_atomic+0xa0/0xa0
[  165.999647]  ? move_module+0x4c/0x630
[  166.003741]  module_alloc+0xe7/0x170
[  166.007747]  ? move_module+0x4c/0x630
[  166.011840]  move_module+0x4c/0x630
[  166.015751]  layout_and_allocate+0x32c/0x560
[  166.020519]  load_module+0x8e0/0x25c0
[  166.024623]  ? layout_and_allocate+0x560/0x560
[  166.029586]  ? kernel_read_file+0x286/0x6b0
[  166.034269]  ? __x64_sys_fspick+0x290/0x290
[  166.038946]  ? userfaultfd_unmap_prep+0x430/0x430
[  166.044203]  ? lock_downgrade+0x130/0x130
[  166.048698]  ? __do_sys_finit_module+0x11a/0x1c0
[  166.053854]  __do_sys_finit_module+0x11a/0x1c0
[  166.058818]  ? __ia32_sys_init_module+0xa0/0xa0
[  166.063882]  ? __seccomp_filter+0x92/0x930
[  166.068494]  do_syscall_64+0x59/0x90
[  166.072492]  ? do_syscall_64+0x69/0x90
[  166.076679]  ? do_syscall_64+0x69/0x90
[  166.080864]  ? do_syscall_64+0x69/0x90
[  166.085047]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[  166.090984]  ? lockdep_hardirqs_on+0x79/0x100
[  166.095855]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Some facts:

1. The #CPUs seems to be more important than the #MEM

Initially we thought the memory size would be the relevant trigger,
because we've only seen it on 8TiB machines. But I was able to
reproduce also on a "small" machine with ~450GiB.

We've seen this issue only on machines with a lot (~448) logical CPUs.

On such systems, I was not able to reproduce when booting the kernel with
"nosmt" so far, which could indicate some kind of concurrency problem.

2. CONFIG_KASAN_INLINE seems to be relevant

This issue only seems to trigger with KASAN enabled, and what I can tell,
only with CONFIG_KASAN_INLINE=y:

CONFIG_KASAN_INLINE: "but makes the kernel's .text size much bigger.", that
should include kernel module to be loaded.

3. All systems have 8, equally sized NUMA nodes

... which implies, that at least one node is practically completely filled with
KASAN data. I remember adjusting the system size with "mem=", such that some
nodes were memory-less but NODE 0 would still have some free memory.
I remember that it still triggered.

My current best guess is that this is a combination of large VMAP demands
(e.g., kernel modules with quite a size due to CONFIG_KASAN_INLINE) and
eventually a concurrency issue with large #CPUs. But I might be wrong and
this might be something zone/node related.

Does any of that ring a bell -- especially why it would fail with 448 logical
CPUs but succeed with 224 logical CPUs (nosmt)?

My best guess would be that the purge_vmap_area_lazy() logic in alloc_vmap_area()
might not be sufficient when there is a lot of concurrency: simply purging
once and then failing might be problematic in corner cases where there is a lot of
concurrent vmap action going on. But that's just my best guess.

Cheers!

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-06 13:46 KASAN-related VMAP allocation errors in debug kernels with many logical CPUS David Hildenbrand
@ 2022-10-06 15:35 ` Uladzislau Rezki
  2022-10-06 16:12   ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Uladzislau Rezki @ 2022-10-06 15:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexander Potapenko, Uladzislau Rezki (Sony), Andrey Konovalov,
	linux-mm@kvack.org, Andrey Ryabinin, Dmitry Vyukov,
	Vincenzo Frascino, kasan-dev

> Hi,
> 
> we're currently hitting a weird vmap issue in debug kernels with KASAN enabled
> on fairly large VMs. I reproduced it on v5.19 (did not get the chance to
> try 6.0 yet because I don't have access to the machine right now, but
> I suspect it persists).
> 
> It seems to trigger when udev probes a massive amount of devices in parallel
> while the system is booting up. Once the system booted, I no longer see any
> such issues.
> 
> 
> [  165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> [  165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> [  165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [  165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
> [  165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
> [  165.947343] Call Trace:
> [  165.950075]  <TASK>
> [  165.952425]  dump_stack_lvl+0x57/0x81
> [  165.956532]  warn_alloc.cold+0x95/0x18a
> [  165.960836]  ? zone_watermark_ok_safe+0x240/0x240
> [  165.966100]  ? slab_free_freelist_hook+0x11d/0x1d0
> [  165.971461]  ? __get_vm_area_node+0x2af/0x360
> [  165.976341]  ? __get_vm_area_node+0x2af/0x360
> [  165.981219]  __vmalloc_node_range+0x291/0x560
> [  165.986087]  ? __mutex_unlock_slowpath+0x161/0x5e0
> [  165.991447]  ? move_module+0x4c/0x630
> [  165.995547]  ? vfree_atomic+0xa0/0xa0
> [  165.999647]  ? move_module+0x4c/0x630
> [  166.003741]  module_alloc+0xe7/0x170
> [  166.007747]  ? move_module+0x4c/0x630
> [  166.011840]  move_module+0x4c/0x630
> [  166.015751]  layout_and_allocate+0x32c/0x560
> [  166.020519]  load_module+0x8e0/0x25c0
>
Can it be that we do not have enough "module section" size? I mean the
section size, which is MODULES_END - MODULES_VADDR is rather small so
some modules are not loaded due to no space.

CONFIG_RANDOMIZE_BASE also creates some offset overhead if enabled on
your box. But it looks it is rather negligible.

Maybe try to increase the module-section size to see if it solves the
problem.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-06 15:35 ` Uladzislau Rezki
@ 2022-10-06 16:12   ` David Hildenbrand
  2022-10-07 15:34     ` Uladzislau Rezki
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2022-10-06 16:12 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Alexander Potapenko, Andrey Konovalov, linux-mm@kvack.org,
	Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasan-dev

On 06.10.22 17:35, Uladzislau Rezki wrote:
>> Hi,
>>
>> we're currently hitting a weird vmap issue in debug kernels with KASAN enabled
>> on fairly large VMs. I reproduced it on v5.19 (did not get the chance to
>> try 6.0 yet because I don't have access to the machine right now, but
>> I suspect it persists).
>>
>> It seems to trigger when udev probes a massive amount of devices in parallel
>> while the system is booting up. Once the system booted, I no longer see any
>> such issues.
>>
>>
>> [  165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
>> [  165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
>> [  165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [  165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
>> [  165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
>> [  165.947343] Call Trace:
>> [  165.950075]  <TASK>
>> [  165.952425]  dump_stack_lvl+0x57/0x81
>> [  165.956532]  warn_alloc.cold+0x95/0x18a
>> [  165.960836]  ? zone_watermark_ok_safe+0x240/0x240
>> [  165.966100]  ? slab_free_freelist_hook+0x11d/0x1d0
>> [  165.971461]  ? __get_vm_area_node+0x2af/0x360
>> [  165.976341]  ? __get_vm_area_node+0x2af/0x360
>> [  165.981219]  __vmalloc_node_range+0x291/0x560
>> [  165.986087]  ? __mutex_unlock_slowpath+0x161/0x5e0
>> [  165.991447]  ? move_module+0x4c/0x630
>> [  165.995547]  ? vfree_atomic+0xa0/0xa0
>> [  165.999647]  ? move_module+0x4c/0x630
>> [  166.003741]  module_alloc+0xe7/0x170
>> [  166.007747]  ? move_module+0x4c/0x630
>> [  166.011840]  move_module+0x4c/0x630
>> [  166.015751]  layout_and_allocate+0x32c/0x560
>> [  166.020519]  load_module+0x8e0/0x25c0
>>
> Can it be that we do not have enough "module section" size? I mean the
> section size, which is MODULES_END - MODULES_VADDR is rather small so
> some modules are not loaded due to no space.
> 
> CONFIG_RANDOMIZE_BASE also creates some offset overhead if enabled on
> your box. But it looks it is rather negligible.

Right, I suspected both points -- but was fairly confused why the 
numbers of CPUs would matter.

What would make sense is that if we're tight on module vmap space, that 
the race I think that could happen with purging only once and then 
failing could become relevant.

> 
> Maybe try to increase the module-section size to see if it solves the
> problem.

What would be the easiest way to do that?

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-06 16:12   ` David Hildenbrand
@ 2022-10-07 15:34     ` Uladzislau Rezki
  2022-10-10  6:56       ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Uladzislau Rezki @ 2022-10-07 15:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov,
	linux-mm@kvack.org, Andrey Ryabinin, Dmitry Vyukov,
	Vincenzo Frascino, kasan-dev

> On 06.10.22 17:35, Uladzislau Rezki wrote:
> > > Hi,
> > > 
> > > we're currently hitting a weird vmap issue in debug kernels with KASAN enabled
> > > on fairly large VMs. I reproduced it on v5.19 (did not get the chance to
> > > try 6.0 yet because I don't have access to the machine right now, but
> > > I suspect it persists).
> > > 
> > > It seems to trigger when udev probes a massive amount of devices in parallel
> > > while the system is booting up. Once the system booted, I no longer see any
> > > such issues.
> > > 
> > > 
> > > [  165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> > > [  165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> > > [  165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> > > [  165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
> > > [  165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
> > > [  165.947343] Call Trace:
> > > [  165.950075]  <TASK>
> > > [  165.952425]  dump_stack_lvl+0x57/0x81
> > > [  165.956532]  warn_alloc.cold+0x95/0x18a
> > > [  165.960836]  ? zone_watermark_ok_safe+0x240/0x240
> > > [  165.966100]  ? slab_free_freelist_hook+0x11d/0x1d0
> > > [  165.971461]  ? __get_vm_area_node+0x2af/0x360
> > > [  165.976341]  ? __get_vm_area_node+0x2af/0x360
> > > [  165.981219]  __vmalloc_node_range+0x291/0x560
> > > [  165.986087]  ? __mutex_unlock_slowpath+0x161/0x5e0
> > > [  165.991447]  ? move_module+0x4c/0x630
> > > [  165.995547]  ? vfree_atomic+0xa0/0xa0
> > > [  165.999647]  ? move_module+0x4c/0x630
> > > [  166.003741]  module_alloc+0xe7/0x170
> > > [  166.007747]  ? move_module+0x4c/0x630
> > > [  166.011840]  move_module+0x4c/0x630
> > > [  166.015751]  layout_and_allocate+0x32c/0x560
> > > [  166.020519]  load_module+0x8e0/0x25c0
> > > 
> > Can it be that we do not have enough "module section" size? I mean the
> > section size, which is MODULES_END - MODULES_VADDR is rather small so
> > some modules are not loaded due to no space.
> > 
> > CONFIG_RANDOMIZE_BASE also creates some offset overhead if enabled on
> > your box. But it looks it is rather negligible.
> 
> Right, I suspected both points -- but was fairly confused why the numbers of
> CPUs would matter.
> 
> What would make sense is that if we're tight on module vmap space, that the
> race I think that could happen with purging only once and then failing could
> become relevant.
> 
> > 
> > Maybe try to increase the module-section size to see if it solves the
> > problem.
> 
> What would be the easiest way to do that?
> 
Sorry for late answer. I was trying to reproduce it on my box. What i
did was trying to load all modules in my system with KASAN_INLINE option:

<snip>
#!/bin/bash

# Exclude test_vmalloc.ko
MODULES_LIST=(`find /lib/modules/$(uname -r) -type f \
	\( -iname "*.ko" -not -iname "test_vmalloc*" \) | awk -F"/" '{print $NF}' | sed 's/.ko//'`)

function moduleExist(){
	MODULE="$1"
	if lsmod | grep "$MODULE" &> /dev/null ; then
		return 0
	else
		return 1
	fi
}

i=0

for module_name in ${MODULES_LIST[@]}; do
	sudo modprobe $module_name

	if moduleExist ${module_name}; then
		((i=i+1))
		echo "Successfully loaded $module_name counter $i"
	fi
done
<snip>

as you wrote it looks like it is not easy to reproduce. So i do not see
any vmap related errors. 

Returning back to the question. I think you could increase the MODULES_END
address and shift the FIXADDR_START little forward. See the dump_pagetables.c
But it might be they are pretty compact and located in the end. So i am not
sure if there is a room there.

Second. It would be good to understand if vmap only fails on allocating for a
module:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dd6cdb201195..53026fdda224 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1614,6 +1614,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
        va->va_end = addr + size;
        va->vm = NULL;
 
+       trace_printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
+
        spin_lock(&vmap_area_lock);
<snip>

--
Uladzislau Rezki


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-07 15:34     ` Uladzislau Rezki
@ 2022-10-10  6:56       ` David Hildenbrand
  2022-10-10 12:19         ` Uladzislau Rezki
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2022-10-10  6:56 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Alexander Potapenko, Andrey Konovalov, linux-mm@kvack.org,
	Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasan-dev

>>> Maybe try to increase the module-section size to see if it solves the
>>> problem.
>>
>> What would be the easiest way to do that?
>>
> Sorry for late answer. I was trying to reproduce it on my box. What i
> did was trying to load all modules in my system with KASAN_INLINE option:
> 

Thanks!

> <snip>
> #!/bin/bash
> 
> # Exclude test_vmalloc.ko
> MODULES_LIST=(`find /lib/modules/$(uname -r) -type f \
> 	\( -iname "*.ko" -not -iname "test_vmalloc*" \) | awk -F"/" '{print $NF}' | sed 's/.ko//'`)
> 
> function moduleExist(){
> 	MODULE="$1"
> 	if lsmod | grep "$MODULE" &> /dev/null ; then
> 		return 0
> 	else
> 		return 1
> 	fi
> }
> 
> i=0
> 
> for module_name in ${MODULES_LIST[@]}; do
> 	sudo modprobe $module_name
> 
> 	if moduleExist ${module_name}; then
> 		((i=i+1))
> 		echo "Successfully loaded $module_name counter $i"
> 	fi
> done
> <snip>
> 
> as you wrote it looks like it is not easy to reproduce. So i do not see
> any vmap related errors.

Yeah, it's quite mystery and only seems to trigger on these systems with 
a lot of CPUs.

> 
> Returning back to the question. I think you could increase the MODULES_END
> address and shift the FIXADDR_START little forward. See the dump_pagetables.c
> But it might be they are pretty compact and located in the end. So i am not
> sure if there is a room there.

That's what I was afraid of :)

> 
> Second. It would be good to understand if vmap only fails on allocating for a
> module:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index dd6cdb201195..53026fdda224 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1614,6 +1614,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>          va->va_end = addr + size;
>          va->vm = NULL;
>   
> +       trace_printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
> +
>          spin_lock(&vmap_area_lock);
> <snip>

I'll try grabbing a suitable system again and add some more debugging 
output. Might take a while, unfortunately.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-10  6:56       ` David Hildenbrand
@ 2022-10-10 12:19         ` Uladzislau Rezki
  2022-10-11 19:52           ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Uladzislau Rezki @ 2022-10-10 12:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov,
	linux-mm@kvack.org, Andrey Ryabinin, Dmitry Vyukov,
	Vincenzo Frascino, kasan-dev

On Mon, Oct 10, 2022 at 08:56:55AM +0200, David Hildenbrand wrote:
> > > > Maybe try to increase the module-section size to see if it solves the
> > > > problem.
> > > 
> > > What would be the easiest way to do that?
> > > 
> > Sorry for late answer. I was trying to reproduce it on my box. What i
> > did was trying to load all modules in my system with KASAN_INLINE option:
> > 
> 
> Thanks!
> 
> > <snip>
> > #!/bin/bash
> > 
> > # Exclude test_vmalloc.ko
> > MODULES_LIST=(`find /lib/modules/$(uname -r) -type f \
> > 	\( -iname "*.ko" -not -iname "test_vmalloc*" \) | awk -F"/" '{print $NF}' | sed 's/.ko//'`)
> > 
> > function moduleExist(){
> > 	MODULE="$1"
> > 	if lsmod | grep "$MODULE" &> /dev/null ; then
> > 		return 0
> > 	else
> > 		return 1
> > 	fi
> > }
> > 
> > i=0
> > 
> > for module_name in ${MODULES_LIST[@]}; do
> > 	sudo modprobe $module_name
> > 
> > 	if moduleExist ${module_name}; then
> > 		((i=i+1))
> > 		echo "Successfully loaded $module_name counter $i"
> > 	fi
> > done
> > <snip>
> > 
> > as you wrote it looks like it is not easy to reproduce. So i do not see
> > any vmap related errors.
> 
> Yeah, it's quite mystery and only seems to trigger on these systems with a
> lot of CPUs.
> 
> > 
> > Returning back to the question. I think you could increase the MODULES_END
> > address and shift the FIXADDR_START little forward. See the dump_pagetables.c
> > But it might be they are pretty compact and located in the end. So i am not
> > sure if there is a room there.
> 
> That's what I was afraid of :)
> 
> > 
> > Second. It would be good to understand if vmap only fails on allocating for a
> > module:
> > 
> > <snip>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index dd6cdb201195..53026fdda224 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1614,6 +1614,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >          va->va_end = addr + size;
> >          va->vm = NULL;
> > +       trace_printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
> > +
> >          spin_lock(&vmap_area_lock);
> > <snip>
> 
> I'll try grabbing a suitable system again and add some more debugging
> output. Might take a while, unfortunately.
> 
Yes that makes sense. Especially to understand if it fails on the MODULES_VADDR
- MODULES_END range or somewhere else. According to your trace output it looks
like that but it would be good to confirm it by adding some traces.

BTW, vmap code is lack of good trace events. Probably it is worth to add
some basic ones.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-10 12:19         ` Uladzislau Rezki
@ 2022-10-11 19:52           ` David Hildenbrand
  2022-10-12 16:36             ` Uladzislau Rezki
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2022-10-11 19:52 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Alexander Potapenko, Andrey Konovalov, linux-mm@kvack.org,
	Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasan-dev

On 10.10.22 14:19, Uladzislau Rezki wrote:
> On Mon, Oct 10, 2022 at 08:56:55AM +0200, David Hildenbrand wrote:
>>>>> Maybe try to increase the module-section size to see if it solves the
>>>>> problem.
>>>>
>>>> What would be the easiest way to do that?
>>>>
>>> Sorry for late answer. I was trying to reproduce it on my box. What i
>>> did was trying to load all modules in my system with KASAN_INLINE option:
>>>
>>
>> Thanks!
>>
>>> <snip>
>>> #!/bin/bash
>>>
>>> # Exclude test_vmalloc.ko
>>> MODULES_LIST=(`find /lib/modules/$(uname -r) -type f \
>>> 	\( -iname "*.ko" -not -iname "test_vmalloc*" \) | awk -F"/" '{print $NF}' | sed 's/.ko//'`)
>>>
>>> function moduleExist(){
>>> 	MODULE="$1"
>>> 	if lsmod | grep "$MODULE" &> /dev/null ; then
>>> 		return 0
>>> 	else
>>> 		return 1
>>> 	fi
>>> }
>>>
>>> i=0
>>>
>>> for module_name in ${MODULES_LIST[@]}; do
>>> 	sudo modprobe $module_name
>>>
>>> 	if moduleExist ${module_name}; then
>>> 		((i=i+1))
>>> 		echo "Successfully loaded $module_name counter $i"
>>> 	fi
>>> done
>>> <snip>
>>>
>>> as you wrote it looks like it is not easy to reproduce. So i do not see
>>> any vmap related errors.
>>
>> Yeah, it's quite mystery and only seems to trigger on these systems with a
>> lot of CPUs.
>>
>>>
>>> Returning back to the question. I think you could increase the MODULES_END
>>> address and shift the FIXADDR_START little forward. See the dump_pagetables.c
>>> But it might be they are pretty compact and located in the end. So i am not
>>> sure if there is a room there.
>>
>> That's what I was afraid of :)
>>
>>>
>>> Second. It would be good to understand if vmap only fails on allocating for a
>>> module:
>>>
>>> <snip>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index dd6cdb201195..53026fdda224 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -1614,6 +1614,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>>>           va->va_end = addr + size;
>>>           va->vm = NULL;
>>> +       trace_printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
>>> +
>>>           spin_lock(&vmap_area_lock);
>>> <snip>
>>
>> I'll try grabbing a suitable system again and add some more debugging
>> output. Might take a while, unfortunately.
>>
> Yes that makes sense. Especially to understand if it fails on the MODULES_VADDR
> - MODULES_END range or somewhere else. According to your trace output it looks
> like that but it would be good to confirm it by adding some traces.
> 
> BTW, vmap code is lack of good trace events. Probably it is worth to add
> some basic ones.

Was lucky to grab that system again. Compiled a custom 6.0 kernel, whereby I printk all vmap allocation errors, including the range similarly to what you suggested above (but printk only on the failure path).

So these are the failing allocations:

# dmesg | grep " -> alloc"
[  168.862511] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.863020] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.863841] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.864562] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.864646] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.865688] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.865718] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.866098] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.866551] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.866752] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.867147] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.867210] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.867312] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.867650] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.867767] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.867815] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.867815] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.868059] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.868463] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.868822] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.868919] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.869843] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.869854] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.870174] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.870611] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.870806] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.870982] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  168.879000] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.449101] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.449834] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.450667] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.451539] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.452326] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.453239] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.454052] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.454697] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.454811] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.455575] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.455754] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.461450] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.805223] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.805507] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.929577] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.930389] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.931244] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.932035] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.932796] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.933592] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.934470] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.935344] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  169.970641] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.191600] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.191875] -> alloc 40960 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.241901] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.242708] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.243465] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.244211] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.245060] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.245868] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.246433] -> alloc 40960 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.246657] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.247451] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.248226] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.248902] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.249704] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.250497] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.251244] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.252076] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.587168] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  170.598995] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  171.865721] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[  172.138557] -> alloc 917504 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400


Really looks like only module vmap space. ~ 1 GiB of vmap module space ...

I did try:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dd6cdb201195..199154a2228a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,8 @@ early_param("nohugevmalloc", set_nohugevmalloc);
  static const bool vmap_allow_huge = false;
  #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
  
+static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0);
+
  bool is_vmalloc_addr(const void *x)
  {
         unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -1574,7 +1576,6 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
         struct vmap_area *va;
         unsigned long freed;
         unsigned long addr;
-       int purged = 0;
         int ret;
  
         BUG_ON(!size);
@@ -1631,23 +1632,22 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
         return va;
  
  overflow:
-       if (!purged) {
+       if (atomic_long_read(&vmap_lazy_nr)) {
                 purge_vmap_area_lazy();
-               purged = 1;
                 goto retry;
         }
  
         freed = 0;
         blocking_notifier_call_chain(&vmap_notify_list, 0, &freed);
  
-       if (freed > 0) {
-               purged = 0;
+       if (freed > 0)
                 goto retry;
-       }
  
-       if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
+       if (!(gfp_mask & __GFP_NOWARN)) {
                 pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
                         size);
+               printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
+       }
  
         kmem_cache_free(vmap_area_cachep, va);
         return ERR_PTR(-EBUSY);
@@ -1690,8 +1690,6 @@ static unsigned long lazy_max_pages(void)
         return log * (32UL * 1024 * 1024 / PAGE_SIZE);
  }
  
-static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0);
-


But that didn't help at all. That system is crazy:

# lspci | wc -l
1117


What I find interesting is that we have these recurring allocations of similar sizes failing.
I wonder if user space is capable of loading the same kernel module concurrently to
trigger a massive amount of allocations, and module loading code only figures out
later that it has already been loaded and backs off.

My best guess would be that module loading is serialized completely, but for some reason,
something seems to go wrong with a lot of concurrency ...

-- 
Thanks,

David / dhildenb



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-11 19:52           ` David Hildenbrand
@ 2022-10-12 16:36             ` Uladzislau Rezki
  2022-10-13 16:21               ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Uladzislau Rezki @ 2022-10-12 16:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov,
	linux-mm@kvack.org, Andrey Ryabinin, Dmitry Vyukov,
	Vincenzo Frascino, kasan-dev

> 
> Was lucky to grab that system again. Compiled a custom 6.0 kernel, whereby I printk all vmap allocation errors, including the range similarly to what you suggested above (but printk only on the failure path).
> 
> So these are the failing allocations:
> 
> # dmesg | grep " -> alloc"
> [  168.862511] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.863020] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.863841] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.864562] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.864646] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.865688] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.865718] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.866098] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.866551] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.866752] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.867147] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.867210] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.867312] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.867650] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.867767] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.867815] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.867815] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.868059] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.868463] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.868822] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.868919] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.869843] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.869854] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.870174] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.870611] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.870806] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.870982] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  168.879000] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.449101] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.449834] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.450667] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.451539] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.452326] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.453239] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.454052] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.454697] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.454811] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.455575] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.455754] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.461450] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.805223] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.805507] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.929577] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.930389] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.931244] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.932035] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.932796] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.933592] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.934470] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.935344] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  169.970641] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.191600] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.191875] -> alloc 40960 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.241901] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.242708] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.243465] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.244211] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.245060] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.245868] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.246433] -> alloc 40960 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.246657] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.247451] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.248226] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.248902] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.249704] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.250497] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.251244] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.252076] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.587168] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  170.598995] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  171.865721] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> [  172.138557] -> alloc 917504 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
> 
OK. It is related to a module vmap space allocation when a module is
inserted. I wounder why it requires 2.5MB for a module? It seems a lot
to me.

> 
> Really looks like only module vmap space. ~ 1 GiB of vmap module space ...
> 
If an allocation request for a module is 2.5MB we can load ~400 modules
having 1GB address space.

"lsmod | wc -l"? How many modules your system has?

> I did try:
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index dd6cdb201195..199154a2228a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -72,6 +72,8 @@ early_param("nohugevmalloc", set_nohugevmalloc);
>  static const bool vmap_allow_huge = false;
>  #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
> +static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0);
> +
>  bool is_vmalloc_addr(const void *x)
>  {
>         unsigned long addr = (unsigned long)kasan_reset_tag(x);
> @@ -1574,7 +1576,6 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>         struct vmap_area *va;
>         unsigned long freed;
>         unsigned long addr;
> -       int purged = 0;
>         int ret;
>         BUG_ON(!size);
> @@ -1631,23 +1632,22 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>         return va;
>  overflow:
> -       if (!purged) {
> +       if (atomic_long_read(&vmap_lazy_nr)) {
>                 purge_vmap_area_lazy();
> -               purged = 1;
>                 goto retry;
>         }
>         freed = 0;
>         blocking_notifier_call_chain(&vmap_notify_list, 0, &freed);
> -       if (freed > 0) {
> -               purged = 0;
> +       if (freed > 0)
>                 goto retry;
> -       }
> -       if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
> +       if (!(gfp_mask & __GFP_NOWARN)) {
>                 pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
>                         size);
> +               printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
> +       }
>         kmem_cache_free(vmap_area_cachep, va);
>         return ERR_PTR(-EBUSY);
> @@ -1690,8 +1690,6 @@ static unsigned long lazy_max_pages(void)
>         return log * (32UL * 1024 * 1024 / PAGE_SIZE);
>  }
> -static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0);
> -
> 
> 
> But that didn't help at all. That system is crazy:
> 
If an allocation fails, the next step is to drain outstanding vmap
areas. So a caller does it from its context and then repeat one more
time and only after that a fail message is printed.

>
> # lspci | wc -l
> 1117
> 
So probably you need a lot of modules in order to fully make functional your HW :)

> 
> What I find interesting is that we have these recurring allocations of similar sizes failing.
> I wonder if user space is capable of loading the same kernel module concurrently to
> trigger a massive amount of allocations, and module loading code only figures out
> later that it has already been loaded and backs off.
> 
If there is a request about allocating memory it has to be succeeded
unless there are some errors like no space no memory.

>
> My best guess would be that module loading is serialized completely, but for some reason,
> something seems to go wrong with a lot of concurrency ...
> 
lazy_max_pages() depends on number of online CPUs. Probably something
related...

I wrote a small patch to dump a modules address space when a fail occurs:

<snip v6.0>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 83b54beb12fa..88d323310df5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1580,6 +1580,37 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
 		kmem_cache_free(vmap_area_cachep, va);
 }
 
+static void
+dump_modules_free_space(unsigned long vstart, unsigned long vend)
+{
+	unsigned long va_start, va_end;
+	unsigned int total = 0;
+	struct vmap_area *va;
+
+	if (vend != MODULES_END)
+		return;
+
+	trace_printk("--- Dump a modules address space: 0x%lx - 0x%lx\n", vstart, vend);
+
+	spin_lock(&free_vmap_area_lock);
+	list_for_each_entry(va, &free_vmap_area_list, list) {
+		va_start = (va->va_start > vstart) ? va->va_start:vstart;
+		va_end = (va->va_end < vend) ? va->va_end:vend;
+
+		if (va_start >= va_end)
+			continue;
+
+		if (va_start >= vstart && va_end <= vend) {
+			trace_printk(" va_free: 0x%lx - 0x%lx size=%lu\n",
+				va_start, va_end, va_end - va_start);
+			total += (va_end - va_start);
+		}
+	}
+
+	spin_unlock(&free_vmap_area_lock);
+	trace_printk("--- Total free: %u ---\n", total);
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend.
@@ -1663,10 +1694,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 		goto retry;
 	}
 
-	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
+	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
 		pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
 			size);
 
+		dump_modules_free_space();
+	}
+
 	kmem_cache_free(vmap_area_cachep, va);
 	return ERR_PTR(-EBUSY);
 }
<snip>

it would be good to understand whether we are really run out of space?
Adding a print of lazy_max_pages() and vmap_lazy_nr would be also good.

--
Uladzislau Rezki


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-12 16:36             ` Uladzislau Rezki
@ 2022-10-13 16:21               ` David Hildenbrand
  2022-10-15  9:23                 ` Uladzislau Rezki
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2022-10-13 16:21 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Alexander Potapenko, Andrey Konovalov, linux-mm@kvack.org,
	Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasan-dev

>>
> OK. It is related to a module vmap space allocation when a module is
> inserted. I wounder why it requires 2.5MB for a module? It seems a lot
> to me.
> 

Indeed. I assume KASAN can go wild when it instruments each and every 
memory access.

>>
>> Really looks like only module vmap space. ~ 1 GiB of vmap module space ...
>>
> If an allocation request for a module is 2.5MB we can load ~400 modules
> having 1GB address space.
> 
> "lsmod | wc -l"? How many modules your system has?
> 

~71, so not even close to 400.

>> What I find interesting is that we have these recurring allocations of similar sizes failing.
>> I wonder if user space is capable of loading the same kernel module concurrently to
>> trigger a massive amount of allocations, and module loading code only figures out
>> later that it has already been loaded and backs off.
>>
> If there is a request about allocating memory it has to be succeeded
> unless there are some errors like no space no memory.

Yes. But as I found out we're really out of space because module loading 
code allocates module VMAP space first, before verifying if the module 
was already loaded or is concurrently getting loaded.

See below.

[...]

> I wrote a small patch to dump a modules address space when a fail occurs:
> 
> <snip v6.0>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 83b54beb12fa..88d323310df5 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1580,6 +1580,37 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
>   		kmem_cache_free(vmap_area_cachep, va);
>   }
>   
> +static void
> +dump_modules_free_space(unsigned long vstart, unsigned long vend)
> +{
> +	unsigned long va_start, va_end;
> +	unsigned int total = 0;
> +	struct vmap_area *va;
> +
> +	if (vend != MODULES_END)
> +		return;
> +
> +	trace_printk("--- Dump a modules address space: 0x%lx - 0x%lx\n", vstart, vend);
> +
> +	spin_lock(&free_vmap_area_lock);
> +	list_for_each_entry(va, &free_vmap_area_list, list) {
> +		va_start = (va->va_start > vstart) ? va->va_start:vstart;
> +		va_end = (va->va_end < vend) ? va->va_end:vend;
> +
> +		if (va_start >= va_end)
> +			continue;
> +
> +		if (va_start >= vstart && va_end <= vend) {
> +			trace_printk(" va_free: 0x%lx - 0x%lx size=%lu\n",
> +				va_start, va_end, va_end - va_start);
> +			total += (va_end - va_start);
> +		}
> +	}
> +
> +	spin_unlock(&free_vmap_area_lock);
> +	trace_printk("--- Total free: %u ---\n", total);
> +}
> +
>   /*
>    * Allocate a region of KVA of the specified size and alignment, within the
>    * vstart and vend.
> @@ -1663,10 +1694,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>   		goto retry;
>   	}
>   
> -	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
> +	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
>   		pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
>   			size);
>   
> +		dump_modules_free_space();
> +	}
> +
>   	kmem_cache_free(vmap_area_cachep, va);
>   	return ERR_PTR(-EBUSY);
>   }

Thanks!

I can spot the same module getting loaded over and over again 
concurrently from user space, only failing after all the allocations 
when realizing that the module is in fact already loaded in 
add_unformed_module(), failing with -EEXIST.

That looks quite inefficient. Here is how often user space tries to load 
the same module on that system. Note that I print *after* allocating 
module VMAP space.

# dmesg | grep Loading | cut -d" " -f5 | sort | uniq -c
     896 acpi_cpufreq
       1 acpi_pad
       1 acpi_power_meter
       2 ahci
       1 cdrom
       2 compiled-in
       1 coretemp
      15 crc32c_intel
     307 crc32_pclmul
       1 crc64
       1 crc64_rocksoft
       1 crc64_rocksoft_generic
      12 crct10dif_pclmul
      16 dca
       1 dm_log
       1 dm_mirror
       1 dm_mod
       1 dm_region_hash
       1 drm
       1 drm_kms_helper
       1 drm_shmem_helper
       1 fat
       1 fb_sys_fops
      14 fjes
       1 fuse
     205 ghash_clmulni_intel
       1 i2c_algo_bit
       1 i2c_i801
       1 i2c_smbus
       4 i40e
       4 ib_core
       1 ib_uverbs
       4 ice
     403 intel_cstate
       1 intel_pch_thermal
       1 intel_powerclamp
       1 intel_rapl_common
       1 intel_rapl_msr
     399 intel_uncore
       1 intel_uncore_frequency
       1 intel_uncore_frequency_common
      64 ioatdma
       1 ipmi_devintf
       1 ipmi_msghandler
       1 ipmi_si
       1 ipmi_ssif
       4 irdma
     406 irqbypass
       1 isst_if_common
     165 isst_if_mbox_msr
     300 kvm
     408 kvm_intel
       1 libahci
       2 libata
       1 libcrc32c
     409 libnvdimm
       8 Loading
       1 lpc_ich
       1 megaraid_sas
       1 mei
       1 mei_me
       1 mgag200
       1 nfit
       1 pcspkr
       1 qrtr
     405 rapl
       1 rfkill
       1 sd_mod
       2 sg
     409 skx_edac
       1 sr_mod
       1 syscopyarea
       1 sysfillrect
       1 sysimgblt
       1 t10_pi
       1 uas
       1 usb_storage
       1 vfat
       1 wmi
       1 x86_pkg_temp_thermal
       1 xfs


For each if these loading request, we'll reserve module VMAP space, and 
free it once we realize later that the module was already previously loaded.

So with a lot of CPUs we might end up trying to load the same module 
that often at the same time that we actually run out of module VMAP space.

I have a prototype patch that seems to fix this in module loading code.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS
  2022-10-13 16:21               ` David Hildenbrand
@ 2022-10-15  9:23                 ` Uladzislau Rezki
  0 siblings, 0 replies; 10+ messages in thread
From: Uladzislau Rezki @ 2022-10-15  9:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov,
	linux-mm@kvack.org, Andrey Ryabinin, Dmitry Vyukov,
	Vincenzo Frascino, kasan-dev

> > > 
> > OK. It is related to a module vmap space allocation when a module is
> > inserted. I wounder why it requires 2.5MB for a module? It seems a lot
> > to me.
> > 
> 
> Indeed. I assume KASAN can go wild when it instruments each and every memory
> access.
> 
> > > 
> > > Really looks like only module vmap space. ~ 1 GiB of vmap module space ...
> > > 
> > If an allocation request for a module is 2.5MB we can load ~400 modules
> > having 1GB address space.
> > 
> > "lsmod | wc -l"? How many modules your system has?
> > 
> 
> ~71, so not even close to 400.
> 
> > > What I find interesting is that we have these recurring allocations of similar sizes failing.
> > > I wonder if user space is capable of loading the same kernel module concurrently to
> > > trigger a massive amount of allocations, and module loading code only figures out
> > > later that it has already been loaded and backs off.
> > > 
> > If there is a request about allocating memory it has to be succeeded
> > unless there are some errors like no space no memory.
> 
> Yes. But as I found out we're really out of space because module loading
> code allocates module VMAP space first, before verifying if the module was
> already loaded or is concurrently getting loaded.
> 
> See below.
> 
> [...]
> 
> > I wrote a small patch to dump a modules address space when a fail occurs:
> > 
> > <snip v6.0>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 83b54beb12fa..88d323310df5 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1580,6 +1580,37 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
> >   		kmem_cache_free(vmap_area_cachep, va);
> >   }
> > +static void
> > +dump_modules_free_space(unsigned long vstart, unsigned long vend)
> > +{
> > +	unsigned long va_start, va_end;
> > +	unsigned int total = 0;
> > +	struct vmap_area *va;
> > +
> > +	if (vend != MODULES_END)
> > +		return;
> > +
> > +	trace_printk("--- Dump a modules address space: 0x%lx - 0x%lx\n", vstart, vend);
> > +
> > +	spin_lock(&free_vmap_area_lock);
> > +	list_for_each_entry(va, &free_vmap_area_list, list) {
> > +		va_start = (va->va_start > vstart) ? va->va_start:vstart;
> > +		va_end = (va->va_end < vend) ? va->va_end:vend;
> > +
> > +		if (va_start >= va_end)
> > +			continue;
> > +
> > +		if (va_start >= vstart && va_end <= vend) {
> > +			trace_printk(" va_free: 0x%lx - 0x%lx size=%lu\n",
> > +				va_start, va_end, va_end - va_start);
> > +			total += (va_end - va_start);
> > +		}
> > +	}
> > +
> > +	spin_unlock(&free_vmap_area_lock);
> > +	trace_printk("--- Total free: %u ---\n", total);
> > +}
> > +
> >   /*
> >    * Allocate a region of KVA of the specified size and alignment, within the
> >    * vstart and vend.
> > @@ -1663,10 +1694,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >   		goto retry;
> >   	}
> > -	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
> > +	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
> >   		pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
> >   			size);
> > +		dump_modules_free_space();
> > +	}
> > +
> >   	kmem_cache_free(vmap_area_cachep, va);
> >   	return ERR_PTR(-EBUSY);
> >   }
> 
> Thanks!
> 
> I can spot the same module getting loaded over and over again concurrently
> from user space, only failing after all the allocations when realizing that
> the module is in fact already loaded in add_unformed_module(), failing with
> -EEXIST.
> 
> That looks quite inefficient. Here is how often user space tries to load the
> same module on that system. Note that I print *after* allocating module VMAP
> space.
> 
OK. It explains the problem :) Indeed it is inefficient. Allocating and later
on figuring out that a module is already there looks weird. Furthermore an
attacking from the user space can be organized.


> # dmesg | grep Loading | cut -d" " -f5 | sort | uniq -c
>     896 acpi_cpufreq
>       1 acpi_pad
>       1 acpi_power_meter
>       2 ahci
>       1 cdrom
>       2 compiled-in
>       1 coretemp
>      15 crc32c_intel
>     307 crc32_pclmul
>       1 crc64
>       1 crc64_rocksoft
>       1 crc64_rocksoft_generic
>      12 crct10dif_pclmul
>      16 dca
>       1 dm_log
>       1 dm_mirror
>       1 dm_mod
>       1 dm_region_hash
>       1 drm
>       1 drm_kms_helper
>       1 drm_shmem_helper
>       1 fat
>       1 fb_sys_fops
>      14 fjes
>       1 fuse
>     205 ghash_clmulni_intel
>       1 i2c_algo_bit
>       1 i2c_i801
>       1 i2c_smbus
>       4 i40e
>       4 ib_core
>       1 ib_uverbs
>       4 ice
>     403 intel_cstate
>       1 intel_pch_thermal
>       1 intel_powerclamp
>       1 intel_rapl_common
>       1 intel_rapl_msr
>     399 intel_uncore
>       1 intel_uncore_frequency
>       1 intel_uncore_frequency_common
>      64 ioatdma
>       1 ipmi_devintf
>       1 ipmi_msghandler
>       1 ipmi_si
>       1 ipmi_ssif
>       4 irdma
>     406 irqbypass
>       1 isst_if_common
>     165 isst_if_mbox_msr
>     300 kvm
>     408 kvm_intel
>       1 libahci
>       2 libata
>       1 libcrc32c
>     409 libnvdimm
>       8 Loading
>       1 lpc_ich
>       1 megaraid_sas
>       1 mei
>       1 mei_me
>       1 mgag200
>       1 nfit
>       1 pcspkr
>       1 qrtr
>     405 rapl
>       1 rfkill
>       1 sd_mod
>       2 sg
>     409 skx_edac
>       1 sr_mod
>       1 syscopyarea
>       1 sysfillrect
>       1 sysimgblt
>       1 t10_pi
>       1 uas
>       1 usb_storage
>       1 vfat
>       1 wmi
>       1 x86_pkg_temp_thermal
>       1 xfs
> 
> 
> For each if these loading request, we'll reserve module VMAP space, and free
> it once we realize later that the module was already previously loaded.
> 
> So with a lot of CPUs we might end up trying to load the same module that
> often at the same time that we actually run out of module VMAP space.
> 
> I have a prototype patch that seems to fix this in module loading code.
> 
Good! I am glad the problem can be solved :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-10-15  9:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-06 13:46 KASAN-related VMAP allocation errors in debug kernels with many logical CPUS David Hildenbrand
2022-10-06 15:35 ` Uladzislau Rezki
2022-10-06 16:12   ` David Hildenbrand
2022-10-07 15:34     ` Uladzislau Rezki
2022-10-10  6:56       ` David Hildenbrand
2022-10-10 12:19         ` Uladzislau Rezki
2022-10-11 19:52           ` David Hildenbrand
2022-10-12 16:36             ` Uladzislau Rezki
2022-10-13 16:21               ` David Hildenbrand
2022-10-15  9:23                 ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).