* [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
@ 2024-06-24 8:49 kernel test robot
2024-06-24 12:05 ` Yosry Ahmed
0 siblings, 1 reply; 24+ messages in thread
From: kernel test robot @ 2024-06-24 8:49 UTC (permalink / raw)
To: Usama Arif
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Chengming Zhou, Yosry Ahmed, Nhat Pham, David Hildenbrand,
Huang, Ying, Hugh Dickins, Johannes Weiner, Matthew Wilcox,
Shakeel Butt, Andi Kleen, linux-kernel, oliver.sang
Hello,
kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
[test failed on linux-next/master f76698bd9a8ca01d3581236082d786e9a6b72bb7]
in testcase: vm-scalability
version: vm-scalability-x86_64-6f4ef16-0_20240303
with following parameters:
runtime: 300
thp_enabled: always
thp_defrag: always
nr_task: 32
nr_ssd: 1
priority: 1
test: swap-w-rand-mt
cpufreq_governor: performance
compiler: gcc-13
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202406241651.963e3e78-oliver.sang@intel.com
[ 34.776816][ T2413] ------------[ cut here ]------------
[ 34.782497][ T2413] WARNING: CPU: 11 PID: 2413 at mm/page_alloc.c:4685 __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11))
[ 34.792245][ T2413] Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c sd_mod t10_pi intel_rapl_msr intel_rapl_common crc64_rocksoft_generic crc64_rocksoft x86_pkg_temp_thermal crc64 intel_powerclamp sg coretemp binfmt_misc kvm_intel ipmi_ssif kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ahci ast libahci rapl drm_shmem_helper intel_cstate mei_me intel_th_gth ioatdma acpi_power_meter i2c_i801 intel_th_pci libata intel_uncore drm_kms_helper ipmi_si acpi_ipmi dax_hmem mei i2c_smbus intel_th intel_pch_thermal dca wmi ipmi_devintf ipmi_msghandler acpi_pad joydev drm fuse loop dm_mod ip_tables
[ 34.849370][ T2413] CPU: 11 PID: 2413 Comm: swapon Not tainted 6.10.0-rc4-00263-g0fa2857d23aa #1
[ 34.858458][ T2413] RIP: 0010:__alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11))
[ 34.864602][ T2413] Code: 00 00 00 48 89 54 24 08 e9 83 fe ff ff 83 fd 0a 0f 86 f6 fd ff ff 80 3d 8a f4 d6 01 00 0f 85 7f fe ff ff c6 05 7d f4 d6 01 01 <0f> 0b e9 71 fe ff ff f7 c1 00 00 80 00 75 61 f7 c1 00 00 08 00 74
All code
========
0: 00 00 add %al,(%rax)
2: 00 48 89 add %cl,-0x77(%rax)
5: 54 push %rsp
6: 24 08 and $0x8,%al
8: e9 83 fe ff ff jmpq 0xfffffffffffffe90
d: 83 fd 0a cmp $0xa,%ebp
10: 0f 86 f6 fd ff ff jbe 0xfffffffffffffe0c
16: 80 3d 8a f4 d6 01 00 cmpb $0x0,0x1d6f48a(%rip) # 0x1d6f4a7
1d: 0f 85 7f fe ff ff jne 0xfffffffffffffea2
23: c6 05 7d f4 d6 01 01 movb $0x1,0x1d6f47d(%rip) # 0x1d6f4a7
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 71 fe ff ff jmpq 0xfffffffffffffea2
31: f7 c1 00 00 80 00 test $0x800000,%ecx
37: 75 61 jne 0x9a
39: f7 c1 00 00 08 00 test $0x80000,%ecx
3f: 74 .byte 0x74
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 71 fe ff ff jmpq 0xfffffffffffffe78
7: f7 c1 00 00 80 00 test $0x800000,%ecx
d: 75 61 jne 0x70
f: f7 c1 00 00 08 00 test $0x80000,%ecx
15: 74 .byte 0x74
[ 34.884371][ T2413] RSP: 0018:ffa000000ce8fda8 EFLAGS: 00010246
[ 34.890619][ T2413] RAX: 0000000000000000 RBX: 0000000000040dc0 RCX: 0000000000000000
[ 34.898766][ T2413] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000040dc0
[ 34.906910][ T2413] RBP: 000000000000000b R08: ffa000000ce8fd44 R09: ff11000104e13bc0
[ 34.915074][ T2413] R10: ffa000000ce8feb0 R11: ffa0000023201000 R12: 0000000000000000
[ 34.923264][ T2413] R13: 0000000000000001 R14: 0000000000000dc0 R15: 0000000003200000
[ 34.931414][ T2413] FS: 00007f8ac1a03840(0000) GS:ff1100103e780000(0000) knlGS:0000000000000000
[ 34.940527][ T2413] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 34.947348][ T2413] CR2: 000056306078b000 CR3: 00000001307f4001 CR4: 0000000000771ef0
[ 34.955505][ T2413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 34.963661][ T2413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 34.971815][ T2413] PKRU: 55555554
[ 34.975551][ T2413] Call Trace:
[ 34.979030][ T2413] <TASK>
[ 34.982179][ T2413] ? __warn (kernel/panic.c:693)
[ 34.986427][ T2413] ? __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11))
[ 34.991965][ T2413] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 34.996643][ T2413] ? handle_bug (arch/x86/kernel/traps.c:239)
[ 35.001163][ T2413] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1))
[ 35.006011][ T2413] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
[ 35.011233][ T2413] ? __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11))
[ 35.016765][ T2413] __kmalloc_large_node (mm/slub.c:4069)
[ 35.022043][ T2413] __kmalloc_noprof (arch/x86/include/asm/bitops.h:417 include/asm-generic/getorder.h:46 mm/slub.c:4113 mm/slub.c:4136)
[ 35.027066][ T2413] ? __do_sys_swapon (mm/swapfile.c:3173)
[ 35.032196][ T2413] ? __do_sys_swapon (mm/swapfile.c:3173)
[ 35.037290][ T2413] ? __do_sys_swapon (mm/swapfile.c:3167)
[ 35.042379][ T2413] __do_sys_swapon (mm/swapfile.c:3173)
[ 35.047300][ T2413] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
[ 35.051955][ T2413] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 35.058002][ T2413] RIP: 0033:0x7f8ac1bcef97
[ 35.062571][ T2413] Code: 73 01 c3 48 8b 0d 69 2e 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a7 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 2e 0d 00 f7 d8 64 89 01 48
All code
========
0: 73 01 jae 0x3
2: c3 retq
3: 48 8b 0d 69 2e 0d 00 mov 0xd2e69(%rip),%rcx # 0xd2e73
a: f7 d8 neg %eax
c: 64 89 01 mov %eax,%fs:(%rcx)
f: 48 83 c8 ff or $0xffffffffffffffff,%rax
13: c3 retq
14: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
1b: 00 00 00
1e: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
23: b8 a7 00 00 00 mov $0xa7,%eax
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d 39 2e 0d 00 mov 0xd2e39(%rip),%rcx # 0xd2e73
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d 39 2e 0d 00 mov 0xd2e39(%rip),%rcx # 0xd2e49
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 35.063745][ T1492] is_virt=false
[ 35.082007][ T2413] RSP: 002b:00007fffa761ac08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a7
[ 35.082010][ T2413] RAX: ffffffffffffffda RBX: 000056306077c190 RCX: 00007f8ac1bcef97
[ 35.082010][ T2413] RDX: 0000000000008001 RSI: 0000000000008001 RDI: 000056306077c190
[ 35.082011][ T2413] RBP: 0000000000008001 R08: 0000000000000ff6 R09: 0000000000001000
[ 35.082012][ T2413] R10: 4e45505355533253 R11: 0000000000000246 R12: 00007fffa761ae3c
[ 35.082012][ T2413] R13: 0000000000000001 R14: 0000003200000000 R15: 000056306077cfe0
[ 35.082014][ T2413] </TASK>
[ 35.082015][ T2413] ---[ end trace 0000000000000000 ]---
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240624/202406241651.963e3e78-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 8:49 [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof kernel test robot
@ 2024-06-24 12:05 ` Yosry Ahmed
2024-06-24 13:06 ` Usama Arif
2024-06-24 18:33 ` Matthew Wilcox
0 siblings, 2 replies; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 12:05 UTC (permalink / raw)
To: kernel test robot
Cc: Usama Arif, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
Huang, Ying, Hugh Dickins, Johannes Weiner, Matthew Wilcox,
Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>
> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
is triggered by the new bitmap_zalloc() call in the swapon path. For a
sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
under the hood) cannot be used to allocate the bitmap.
Usama, I think we want to use a variant of kvmalloc() here.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 12:05 ` Yosry Ahmed
@ 2024-06-24 13:06 ` Usama Arif
2024-06-24 15:26 ` Hugh Dickins
2024-06-24 18:33 ` Matthew Wilcox
1 sibling, 1 reply; 24+ messages in thread
From: Usama Arif @ 2024-06-24 13:06 UTC (permalink / raw)
To: Yosry Ahmed, kernel test robot
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Chengming Zhou, Nhat Pham, David Hildenbrand, Huang, Ying,
Hugh Dickins, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
Andi Kleen, linux-kernel
On 24/06/2024 15:05, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>>
>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> under the hood) cannot be used to allocate the bitmap.
>
> Usama, I think we want to use a variant of kvmalloc() here.
Yes, just testing with below diff now. The patch is not in mm-stable
yet, so will just send another revision with below diff included. Thanks!
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0b8270359bcf..2263f71baa31 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *,
specialfile)
free_percpu(p->cluster_next_cpu);
p->cluster_next_cpu = NULL;
vfree(swap_map);
- bitmap_free(p->zeromap);
+ kvfree(p->zeromap);
kvfree(cluster_info);
/* Destroy swap account information */
swap_cgroup_swapoff(p->type);
@@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
specialfile, int, swap_flags)
goto bad_swap_unlock_inode;
}
- p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
+ p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
if (!p->zeromap) {
error = -ENOMEM;
goto bad_swap_unlock_inode;
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 13:06 ` Usama Arif
@ 2024-06-24 15:26 ` Hugh Dickins
2024-06-24 15:39 ` Usama Arif
0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2024-06-24 15:26 UTC (permalink / raw)
To: Usama Arif
Cc: Yosry Ahmed, kernel test robot, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Matthew Wilcox, Shakeel Butt, Andi Kleen,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2392 bytes --]
On Mon, 24 Jun 2024, Usama Arif wrote:
> On 24/06/2024 15:05, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com>
> > wrote:
> >>
> >>
> >> Hello,
> >>
> >> kernel test robot noticed
> >> "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >>
> >> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to
> >> be swapped out in a bitmap")
> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> > is triggered by the new bitmap_zalloc() call in the swapon path. For a
> > sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> > under the hood) cannot be used to allocate the bitmap.
> >
> > Usama, I think we want to use a variant of kvmalloc() here.
Yes, I hit the same problem with swapon in my testing,
and had been intending to send a patch.
>
> Yes, just testing with below diff now. The patch is not in mm-stable yet, so
> will just send another revision with below diff included. Thanks!
>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0b8270359bcf..2263f71baa31 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *,
> specialfile)
> free_percpu(p->cluster_next_cpu);
> p->cluster_next_cpu = NULL;
> vfree(swap_map);
> - bitmap_free(p->zeromap);
> + kvfree(p->zeromap);
Yes.
> kvfree(cluster_info);
> /* Destroy swap account information */
> swap_cgroup_swapoff(p->type);
> @@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
> specialfile, int, swap_flags)
> goto bad_swap_unlock_inode;
> }
>
> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
No, 8 is not right for 32-bit kernels. I think you want
p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
but please check it carefully, I'm easily confused by such conversions.
Hugh
> if (!p->zeromap) {
> error = -ENOMEM;
> goto bad_swap_unlock_inode;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 15:26 ` Hugh Dickins
@ 2024-06-24 15:39 ` Usama Arif
2024-06-24 15:55 ` Hugh Dickins
2024-06-24 16:56 ` Yosry Ahmed
0 siblings, 2 replies; 24+ messages in thread
From: Usama Arif @ 2024-06-24 15:39 UTC (permalink / raw)
To: Hugh Dickins
Cc: Yosry Ahmed, kernel test robot, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel
On 24/06/2024 18:26, Hugh Dickins wrote:
> On Mon, 24 Jun 2024, Usama Arif wrote:
>> On 24/06/2024 15:05, Yosry Ahmed wrote:
>>> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> kernel test robot noticed
>>>> "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>>>>
>>>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to
>>>> be swapped out in a bitmap")
>>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
>>> is triggered by the new bitmap_zalloc() call in the swapon path. For a
>>> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
>>> under the hood) cannot be used to allocate the bitmap.
>>>
>>> Usama, I think we want to use a variant of kvmalloc() here.
> Yes, I hit the same problem with swapon in my testing,
> and had been intending to send a patch.
>
>> Yes, just testing with below diff now. The patch is not in mm-stable yet, so
>> will just send another revision with below diff included. Thanks!
>>
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 0b8270359bcf..2263f71baa31 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *,
>> specialfile)
>> free_percpu(p->cluster_next_cpu);
>> p->cluster_next_cpu = NULL;
>> vfree(swap_map);
>> - bitmap_free(p->zeromap);
>> + kvfree(p->zeromap);
> Yes.
>
>> kvfree(cluster_info);
>> /* Destroy swap account information */
>> swap_cgroup_swapoff(p->type);
>> @@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
>> specialfile, int, swap_flags)
>> goto bad_swap_unlock_inode;
>> }
>>
>> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> No, 8 is not right for 32-bit kernels. I think you want
> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> but please check it carefully, I'm easily confused by such conversions.
>
> Hugh
Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
GFP_KERNEL);
if using BITS_TO_LONGS.
Will wait sometime incase there are more comments and will send out
another version.
Thanks!
>> if (!p->zeromap) {
>> error = -ENOMEM;
>> goto bad_swap_unlock_inode;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 15:39 ` Usama Arif
@ 2024-06-24 15:55 ` Hugh Dickins
2024-06-24 16:56 ` Yosry Ahmed
1 sibling, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2024-06-24 15:55 UTC (permalink / raw)
To: Usama Arif
Cc: Hugh Dickins, Yosry Ahmed, kernel test robot, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 3113 bytes --]
On Mon, 24 Jun 2024, Usama Arif wrote:
> On 24/06/2024 18:26, Hugh Dickins wrote:
> > On Mon, 24 Jun 2024, Usama Arif wrote:
> >> On 24/06/2024 15:05, Yosry Ahmed wrote:
> >>> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> kernel test robot noticed
> >>>> "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >>>>
> >>>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages
> >>>> to
> >>>> be swapped out in a bitmap")
> >>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >>> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> >>> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> >>> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> >>> under the hood) cannot be used to allocate the bitmap.
> >>>
> >>> Usama, I think we want to use a variant of kvmalloc() here.
> > Yes, I hit the same problem with swapon in my testing,
> > and had been intending to send a patch.
> >
> >> Yes, just testing with below diff now. The patch is not in mm-stable yet,
> >> so
> >> will just send another revision with below diff included. Thanks!
> >>
> >>
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index 0b8270359bcf..2263f71baa31 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *,
> >> specialfile)
> >> free_percpu(p->cluster_next_cpu);
> >> p->cluster_next_cpu = NULL;
> >> vfree(swap_map);
> >> - bitmap_free(p->zeromap);
> >> + kvfree(p->zeromap);
> > Yes.
> >
> >> kvfree(cluster_info);
> >> /* Destroy swap account information */
> >> swap_cgroup_swapoff(p->type);
> >> @@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
> >> specialfile, int, swap_flags)
> >> goto bad_swap_unlock_inode;
> >> }
> >>
> >> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> >> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> > No, 8 is not right for 32-bit kernels. I think you want
> > p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> > but please check it carefully, I'm easily confused by such conversions.
> >
> > Hugh
>
> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
>
> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
> GFP_KERNEL);
>
> if using BITS_TO_LONGS.
No doubt you're right (I'm glad I already admitted to being confused).
Personally, I'd just say sizeof(long), but whatever you prefer.
Hugh
>
> Will wait sometime incase there are more comments and will send out another
> version.
>
> Thanks!
>
> >> if (!p->zeromap) {
> >> error = -ENOMEM;
> >> goto bad_swap_unlock_inode;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 15:39 ` Usama Arif
2024-06-24 15:55 ` Hugh Dickins
@ 2024-06-24 16:56 ` Yosry Ahmed
2024-06-24 17:26 ` Usama Arif
1 sibling, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 16:56 UTC (permalink / raw)
To: Usama Arif
Cc: Hugh Dickins, kernel test robot, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel,
Vlastimil Babka (SUSE), Yury Norov, Rasmus Villemoes
[..]
> >>
> >> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> >> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> > No, 8 is not right for 32-bit kernels. I think you want
> > p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> > but please check it carefully, I'm easily confused by such conversions.
> >
> > Hugh
>
> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
>
> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
> GFP_KERNEL);
You can do something similar to bitmap_zalloc() and use:
kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
| __GFP_ZERO)
I don't see a kvzalloc_array() variant to use directly, but it should
be trivial to add it. I can see other users of kvmalloc_array() that
pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
, or you could take it a step further and add bitmap_kvzalloc(),
assuming the maintainers are open to that.
The main reason I want to avoid doing the multiplication inline is
because kvmalloc_array() has a check_mul_overflow() check, and I
assume it must have been added for a reason.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 16:56 ` Yosry Ahmed
@ 2024-06-24 17:26 ` Usama Arif
2024-06-24 17:31 ` Yosry Ahmed
0 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2024-06-24 17:26 UTC (permalink / raw)
To: Yosry Ahmed, Hugh Dickins
Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
Huang, Ying, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
Andi Kleen, linux-kernel, Vlastimil Babka (SUSE), Yury Norov,
Rasmus Villemoes
On 24/06/2024 19:56, Yosry Ahmed wrote:
> [..]
>>>> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>>>> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
>>> No, 8 is not right for 32-bit kernels. I think you want
>>> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
>>> but please check it carefully, I'm easily confused by such conversions.
>>>
>>> Hugh
>> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
>>
>> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
>> GFP_KERNEL);
> You can do something similar to bitmap_zalloc() and use:
>
> kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
> | __GFP_ZERO)
>
> I don't see a kvzalloc_array() variant to use directly, but it should
> be trivial to add it. I can see other users of kvmalloc_array() that
> pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
>
> , or you could take it a step further and add bitmap_kvzalloc(),
> assuming the maintainers are open to that.
Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense
that bitmap should only be limited to MAX_PAGE_ORDER size. I can add
this patch below at the start of the series and use it in the patch for
zeropage swap optimization.
bitmap: add support for virtually contiguous bitmap
The current bitmap_zalloc API limits the allocation to MAX_PAGE_ORDER,
which prevents larger order bitmap allocations. Introduce
bitmap_kvzalloc that will allow larger allocations of bitmap.
kvmalloc_array still attempts to allocate physically contiguous memory,
but upon failure, falls back to non-contiguous (vmalloc) allocation.
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 8c4768c44a01..881c2ff2e834 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -131,9 +131,11 @@ struct device;
*/
unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
+unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
node);
unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t flags, int
node);
void bitmap_free(const unsigned long *bitmap);
+void bitmap_kvfree(const unsigned long *bitmap);
DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))
diff --git a/lib/bitmap.c b/lib/bitmap.c
index b97692854966..eabbfb85fb45 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits,
gfp_t flags)
}
EXPORT_SYMBOL(bitmap_zalloc);
+unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
+{
+ return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
+ flags | __GFP_ZERO);
+}
+EXPORT_SYMBOL(bitmap_zalloc);
+
unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
node)
{
return kmalloc_array_node(BITS_TO_LONGS(nbits), sizeof(unsigned
long),
@@ -746,6 +753,12 @@ void bitmap_free(const unsigned long *bitmap)
}
EXPORT_SYMBOL(bitmap_free);
+void bitmap_kvfree(const unsigned long *bitmap)
+{
+ kvfree(bitmap);
+}
+EXPORT_SYMBOL(bitmap_kvfree);
+
static void devm_bitmap_free(void *data)
{
unsigned long *bitmap = data;
>
> The main reason I want to avoid doing the multiplication inline is
> because kvmalloc_array() has a check_mul_overflow() check, and I
> assume it must have been added for a reason.
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 17:26 ` Usama Arif
@ 2024-06-24 17:31 ` Yosry Ahmed
2024-06-24 18:26 ` Usama Arif
0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 17:31 UTC (permalink / raw)
To: Usama Arif
Cc: Hugh Dickins, kernel test robot, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel,
Vlastimil Babka (SUSE), Yury Norov, Rasmus Villemoes
On Mon, Jun 24, 2024 at 10:26 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 24/06/2024 19:56, Yosry Ahmed wrote:
> > [..]
> >>>> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> >>>> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> >>> No, 8 is not right for 32-bit kernels. I think you want
> >>> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> >>> but please check it carefully, I'm easily confused by such conversions.
> >>>
> >>> Hugh
> >> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
> >>
> >> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
> >> GFP_KERNEL);
> > You can do something similar to bitmap_zalloc() and use:
> >
> > kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
> > | __GFP_ZERO)
> >
> > I don't see a kvzalloc_array() variant to use directly, but it should
> > be trivial to add it. I can see other users of kvmalloc_array() that
> > pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
> >
> > , or you could take it a step further and add bitmap_kvzalloc(),
> > assuming the maintainers are open to that.
>
> Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense
> that bitmap should only be limited to MAX_PAGE_ORDER size. I can add
> this patch below at the start of the series and use it in the patch for
> zeropage swap optimization.
>
>
> bitmap: add support for virtually contiguous bitmap
>
> The current bitmap_zalloc API limits the allocation to MAX_PAGE_ORDER,
> which prevents larger order bitmap allocations. Introduce
> bitmap_kvzalloc that will allow larger allocations of bitmap.
> kvmalloc_array still attempts to allocate physically contiguous memory,
> but upon failure, falls back to non-contiguous (vmalloc) allocation.
>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
LGTM with a small fix below.
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 8c4768c44a01..881c2ff2e834 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -131,9 +131,11 @@ struct device;
> */
> unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
> unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
> unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
> node);
> unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t flags, int
> node);
> void bitmap_free(const unsigned long *bitmap);
> +void bitmap_kvfree(const unsigned long *bitmap);
>
> DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))
>
> diff --git a/lib/bitmap.c b/lib/bitmap.c
> index b97692854966..eabbfb85fb45 100644
> --- a/lib/bitmap.c
> +++ b/lib/bitmap.c
> @@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits,
> gfp_t flags)
> }
> EXPORT_SYMBOL(bitmap_zalloc);
>
> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
> +{
> + return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
> + flags | __GFP_ZERO);
> +}
> +EXPORT_SYMBOL(bitmap_zalloc);
EXPORT_SYMBOL(bitmap_kvzalloc)*
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 17:31 ` Yosry Ahmed
@ 2024-06-24 18:26 ` Usama Arif
2024-06-27 11:05 ` Usama Arif
0 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2024-06-24 18:26 UTC (permalink / raw)
To: Yosry Ahmed, Hugh Dickins, Yury Norov, Rasmus Villemoes
Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
Huang, Ying, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
Andi Kleen, linux-kernel, Vlastimil Babka (SUSE), Yury Norov,
Rasmus Villemoes
On 24/06/2024 20:31, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 10:26 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> On 24/06/2024 19:56, Yosry Ahmed wrote:
>>> [..]
>>>>>> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>>>>>> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
>>>>> No, 8 is not right for 32-bit kernels. I think you want
>>>>> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
>>>>> but please check it carefully, I'm easily confused by such conversions.
>>>>>
>>>>> Hugh
>>>> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
>>>>
>>>> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
>>>> GFP_KERNEL);
>>> You can do something similar to bitmap_zalloc() and use:
>>>
>>> kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
>>> | __GFP_ZERO)
>>>
>>> I don't see a kvzalloc_array() variant to use directly, but it should
>>> be trivial to add it. I can see other users of kvmalloc_array() that
>>> pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
>>>
>>> , or you could take it a step further and add bitmap_kvzalloc(),
>>> assuming the maintainers are open to that.
>> Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense
>> that bitmap should only be limited to MAX_PAGE_ORDER size. I can add
>> this patch below at the start of the series and use it in the patch for
>> zeropage swap optimization.
>>
>>
>> bitmap: add support for virtually contiguous bitmap
>>
>> The current bitmap_zalloc API limits the allocation to MAX_PAGE_ORDER,
>> which prevents larger order bitmap allocations. Introduce
>> bitmap_kvzalloc that will allow larger allocations of bitmap.
>> kvmalloc_array still attempts to allocate physically contiguous memory,
>> but upon failure, falls back to non-contiguous (vmalloc) allocation.
>>
>> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>
> LGTM with a small fix below.
>
>> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
>> index 8c4768c44a01..881c2ff2e834 100644
>> --- a/include/linux/bitmap.h
>> +++ b/include/linux/bitmap.h
>> @@ -131,9 +131,11 @@ struct device;
>> */
>> unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
>> unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
>> unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
>> node);
>> unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t flags, int
>> node);
>> void bitmap_free(const unsigned long *bitmap);
>> +void bitmap_kvfree(const unsigned long *bitmap);
>>
>> DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))
>>
>> diff --git a/lib/bitmap.c b/lib/bitmap.c
>> index b97692854966..eabbfb85fb45 100644
>> --- a/lib/bitmap.c
>> +++ b/lib/bitmap.c
>> @@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits,
>> gfp_t flags)
>> }
>> EXPORT_SYMBOL(bitmap_zalloc);
>>
>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
>> +{
>> + return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
>> + flags | __GFP_ZERO);
>> +}
>> +EXPORT_SYMBOL(bitmap_zalloc);
> EXPORT_SYMBOL(bitmap_kvzalloc)*
Actually, does it make more sense to change the behaviour of the current
APIs like below instead of above patch? Or is there an expectation that
the current bitmap API is supposed to work only on physically contiguous
bits?
I believe in the kernel if the allocation/free starts with 'k' its
physically contiguous and with "kv" its physically contiguous if
possible, otherwise virtually contiguous. The bitmap functions dont have
either, so we could change the current implementation. I believe it
would not impact the current users of the functions as the first attempt
is physically contiguous which is how it works currently, and only upon
failure it would be virtual and it would increase the use of current
bitmap API to greater than MAX_PAGE_ORDER size allocations.
Yury Norov and Rasmus Villemoes, any views on this?
Thanks
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 7247e217e21b..ad771dc81afa 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -804,6 +804,7 @@ kvmalloc_array_node_noprof(size_t n, size_t size,
gfp_t flags, int node)
#define kvcalloc_node_noprof(_n,_s,_f,_node)
kvmalloc_array_node_noprof(_n,_s,(_f)|__GFP_ZERO,_node)
#define kvcalloc_noprof(...) kvcalloc_node_noprof(__VA_ARGS__,
NUMA_NO_NODE)
+#define kvmalloc_array_node(...)
alloc_hooks(kvmalloc_array_node_noprof(__VA_ARGS__))
#define kvmalloc_array(...)
alloc_hooks(kvmalloc_array_noprof(__VA_ARGS__))
#define kvcalloc_node(...) alloc_hooks(kvcalloc_node_noprof(__VA_ARGS__))
#define kvcalloc(...) alloc_hooks(kvcalloc_noprof(__VA_ARGS__))
diff --git a/lib/bitmap.c b/lib/bitmap.c
index b97692854966..272164dcbef1 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -716,7 +716,7 @@ void bitmap_fold(unsigned long *dst, const unsigned
long *orig,
unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags)
{
- return kmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
+ return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
flags);
}
EXPORT_SYMBOL(bitmap_alloc);
@@ -729,7 +729,7 @@ EXPORT_SYMBOL(bitmap_zalloc);
unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
node)
{
- return kmalloc_array_node(BITS_TO_LONGS(nbits), sizeof(unsigned
long),
+ return kvmalloc_array_node(BITS_TO_LONGS(nbits), sizeof(unsigned
long),
flags, node);
}
EXPORT_SYMBOL(bitmap_alloc_node);
@@ -742,7 +742,7 @@ EXPORT_SYMBOL(bitmap_zalloc_node);
void bitmap_free(const unsigned long *bitmap)
{
- kfree(bitmap);
+ kvfree(bitmap);
}
EXPORT_SYMBOL(bitmap_free);
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 12:05 ` Yosry Ahmed
2024-06-24 13:06 ` Usama Arif
@ 2024-06-24 18:33 ` Matthew Wilcox
2024-06-24 18:50 ` Usama Arif
2024-06-24 18:53 ` Yosry Ahmed
1 sibling, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 18:33 UTC (permalink / raw)
To: Yosry Ahmed
Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> > kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >
> > commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> under the hood) cannot be used to allocate the bitmap.
Do we need to use a bitmap?
We could place a special entry in the swapcache instead (there's
XA_ZERO_ENTRY already defined, and if we need a different entry that's
not XA_ZERO_ENTRY, there's room for a few hundred more special entries).
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:33 ` Matthew Wilcox
@ 2024-06-24 18:50 ` Usama Arif
2024-06-24 18:53 ` Yosry Ahmed
2024-06-24 18:54 ` Matthew Wilcox
2024-06-24 18:53 ` Yosry Ahmed
1 sibling, 2 replies; 24+ messages in thread
From: Usama Arif @ 2024-06-24 18:50 UTC (permalink / raw)
To: Matthew Wilcox, Yosry Ahmed
Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
Huang, Ying, Hugh Dickins, Johannes Weiner, Shakeel Butt,
Andi Kleen, linux-kernel
On 24/06/2024 21:33, Matthew Wilcox wrote:
> On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
>> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
>>> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>>>
>>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
>> is triggered by the new bitmap_zalloc() call in the swapon path. For a
>> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
>> under the hood) cannot be used to allocate the bitmap.
> Do we need to use a bitmap?
>
> We could place a special entry in the swapcache instead (there's
> XA_ZERO_ENTRY already defined, and if we need a different entry that's
> not XA_ZERO_ENTRY, there's room for a few hundred more special entries).
I was going for the most space-efficient and simplest data structure,
which is bitmap. I believe xarray is either pointer or integer between 0
and LONG_MAX? We could convert the individual bits into integer and
store them, and have another function to extract the integer stored in
xarray to a bit, but I think thats basically a separate bitmap_xarray
API (which would probably take more space than a traditional bitmap API,
and I dont want to make this series dependent on something like that),
so I would prefer to use bitmap.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:33 ` Matthew Wilcox
2024-06-24 18:50 ` Usama Arif
@ 2024-06-24 18:53 ` Yosry Ahmed
2024-06-24 18:56 ` Matthew Wilcox
1 sibling, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 18:53 UTC (permalink / raw)
To: Matthew Wilcox
Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 11:33 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> > >
> > > commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> > is triggered by the new bitmap_zalloc() call in the swapon path. For a
> > sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> > under the hood) cannot be used to allocate the bitmap.
>
> Do we need to use a bitmap?
>
> We could place a special entry in the swapcache instead (there's
> XA_ZERO_ENTRY already defined, and if we need a different entry that's
> not XA_ZERO_ENTRY, there's room for a few hundred more special entries).
After a page is swapped out during reclaim, __remove_mapping() will
call __delete_from_swap_cache() to replace the swap cache entry with a
shadow entry (which is an xa_value).
So I believe we cannot use a special xarray value without making
fundamental changes. We can, perhaps, try to pack an extra bit in the
shadow entry. In this case, we will need to change the swapin code to
check for this magic bit when looking up a folio in the swap cache,
and extra synchronization to make sure concurrent lookups do not
allocate and zero separate folios.
IOW, I think it's possible but probably with more complexity, and
perhaps not worth it.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:50 ` Usama Arif
@ 2024-06-24 18:53 ` Yosry Ahmed
2024-06-24 18:54 ` Matthew Wilcox
1 sibling, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 18:53 UTC (permalink / raw)
To: Usama Arif
Cc: Matthew Wilcox, kernel test robot, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 11:50 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 24/06/2024 21:33, Matthew Wilcox wrote:
> > On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> >> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> >>> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >>>
> >>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> >>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> >> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> >> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> >> under the hood) cannot be used to allocate the bitmap.
> > Do we need to use a bitmap?
> >
> > We could place a special entry in the swapcache instead (there's
> > XA_ZERO_ENTRY already defined, and if we need a different entry that's
> > not XA_ZERO_ENTRY, there's room for a few hundred more special entries).
>
> I was going for the most space-efficient and simplest data structure,
> which is bitmap. I believe xarray is either pointer or integer between 0
> and LONG_MAX? We could convert the individual bits into integer and
> store them, and have another function to extract the integer stored in
> xarray to a bit, but I think thats basically a separate bitmap_xarray
> API (which would probably take more space than a traditional bitmap API,
> and I dont want to make this series dependent on something like that),
> so I would prefer to use bitmap.
I believe Matthew meant reusing the xarray used by the existing
swapcache, not adding a new one for this purpose. See my response.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:50 ` Usama Arif
2024-06-24 18:53 ` Yosry Ahmed
@ 2024-06-24 18:54 ` Matthew Wilcox
1 sibling, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 18:54 UTC (permalink / raw)
To: Usama Arif
Cc: Yosry Ahmed, kernel test robot, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 09:50:21PM +0300, Usama Arif wrote:
> On 24/06/2024 21:33, Matthew Wilcox wrote:
> > On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > > kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> > > >
> > > > commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> > > is triggered by the new bitmap_zalloc() call in the swapon path. For a
> > > sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> > > under the hood) cannot be used to allocate the bitmap.
> > Do we need to use a bitmap?
> >
> > We could place a special entry in the swapcache instead (there's
> > XA_ZERO_ENTRY already defined, and if we need a different entry that's
> > not XA_ZERO_ENTRY, there's room for a few hundred more special entries).
>
> I was going for the most space-efficient and simplest data structure, which
> is bitmap. I believe xarray is either pointer or integer between 0 and
> LONG_MAX? We could convert the individual bits into integer and store them,
> and have another function to extract the integer stored in xarray to a bit,
> but I think thats basically a separate bitmap_xarray API (which would
> probably take more space than a traditional bitmap API, and I dont want to
> make this series dependent on something like that), so I would prefer to use
> bitmap.
But we already _have_ an xarray. Instead of storing a swap entry in
it, we could store an XA_ZERO_ENTRY.
If there are long runs of zero pages, then this may not be the best
idea. But then a bitmap isn't the best data structure for long runs
either.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:53 ` Yosry Ahmed
@ 2024-06-24 18:56 ` Matthew Wilcox
2024-06-24 18:57 ` Yosry Ahmed
0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 18:56 UTC (permalink / raw)
To: Yosry Ahmed
Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> After a page is swapped out during reclaim, __remove_mapping() will
> call __delete_from_swap_cache() to replace the swap cache entry with a
> shadow entry (which is an xa_value).
Special entries are disjoint from shadow entries. Shadow entries have
the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
Special entries have values below 4096 which end in 10 (are congruent
to 2 modulo 4).
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:56 ` Matthew Wilcox
@ 2024-06-24 18:57 ` Yosry Ahmed
2024-06-24 19:26 ` Matthew Wilcox
0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 18:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > After a page is swapped out during reclaim, __remove_mapping() will
> > call __delete_from_swap_cache() to replace the swap cache entry with a
> > shadow entry (which is an xa_value).
>
> Special entries are disjoint from shadow entries. Shadow entries have
> the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> Special entries have values below 4096 which end in 10 (are congruent
> to 2 modulo 4).
You are implying that we would no longer have a shadow entry for such
zero folios, because we will be storing a special entry instead.
Right?
This is the "fundamental" change I am talking about.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:57 ` Yosry Ahmed
@ 2024-06-24 19:26 ` Matthew Wilcox
2024-06-24 19:34 ` Yosry Ahmed
0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 19:26 UTC (permalink / raw)
To: Yosry Ahmed
Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > After a page is swapped out during reclaim, __remove_mapping() will
> > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > shadow entry (which is an xa_value).
> >
> > Special entries are disjoint from shadow entries. Shadow entries have
> > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > Special entries have values below 4096 which end in 10 (are congruent
> > to 2 modulo 4).
>
> You are implying that we would no longer have a shadow entry for such
> zero folios, because we will be storing a special entry instead.
> Right?
umm ... maybe I have a misunderstanding here.
I'm saying that there wouldn't be a _swap_ entry here because the folio
wouldn't be stored anywhere on the swap device. But there could be a
_shadow_ entry. Although if the page is full of zeroes, it was probably
never referenced and doesn't really need a shadow entry.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 19:26 ` Matthew Wilcox
@ 2024-06-24 19:34 ` Yosry Ahmed
2024-06-24 19:50 ` Matthew Wilcox
0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 19:34 UTC (permalink / raw)
To: Matthew Wilcox
Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > shadow entry (which is an xa_value).
> > >
> > > Special entries are disjoint from shadow entries. Shadow entries have
> > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > Special entries have values below 4096 which end in 10 (are congruent
> > > to 2 modulo 4).
> >
> > You are implying that we would no longer have a shadow entry for such
> > zero folios, because we will be storing a special entry instead.
> > Right?
>
> umm ... maybe I have a misunderstanding here.
>
> I'm saying that there wouldn't be a _swap_ entry here because the folio
> wouldn't be stored anywhere on the swap device. But there could be a
> _shadow_ entry. Although if the page is full of zeroes, it was probably
> never referenced and doesn't really need a shadow entry.
Is it possible to have a shadow entry AND a special entry (e.g.
XA_ZERO_ENTRY) at the same index? This is what would be required to
maintain the current behavior (assuming we really need the shadow
entries for such zeroed folios).
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 19:34 ` Yosry Ahmed
@ 2024-06-24 19:50 ` Matthew Wilcox
2024-06-24 20:39 ` Shakeel Butt
0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 19:50 UTC (permalink / raw)
To: Yosry Ahmed
Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > shadow entry (which is an xa_value).
> > > >
> > > > Special entries are disjoint from shadow entries. Shadow entries have
> > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > to 2 modulo 4).
> > >
> > > You are implying that we would no longer have a shadow entry for such
> > > zero folios, because we will be storing a special entry instead.
> > > Right?
> >
> > umm ... maybe I have a misunderstanding here.
> >
> > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > wouldn't be stored anywhere on the swap device. But there could be a
> > _shadow_ entry. Although if the page is full of zeroes, it was probably
> > never referenced and doesn't really need a shadow entry.
>
> Is it possible to have a shadow entry AND a special entry (e.g.
> XA_ZERO_ENTRY) at the same index? This is what would be required to
> maintain the current behavior (assuming we really need the shadow
> entries for such zeroed folios).
No, just like it's not possible to have a swap entry and a shadow entry
at the same location. You have to choose. But the zero entry is an
alternative to the swap entry, not the shadow entry.
As I understand the swap cache, at the moment, you can have four
possible results from a lookup:
- NULL
- a swap entry
- a shadow entry
- a folio
Do I have that wrong?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 19:50 ` Matthew Wilcox
@ 2024-06-24 20:39 ` Shakeel Butt
2024-06-24 20:51 ` Matthew Wilcox
0 siblings, 1 reply; 24+ messages in thread
From: Shakeel Butt @ 2024-06-24 20:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Yosry Ahmed, kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 08:50:45PM GMT, Matthew Wilcox wrote:
> On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > > shadow entry (which is an xa_value).
> > > > >
> > > > > Special entries are disjoint from shadow entries. Shadow entries have
> > > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > > to 2 modulo 4).
> > > >
> > > > You are implying that we would no longer have a shadow entry for such
> > > > zero folios, because we will be storing a special entry instead.
> > > > Right?
> > >
> > > umm ... maybe I have a misunderstanding here.
> > >
> > > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > > wouldn't be stored anywhere on the swap device. But there could be a
> > > _shadow_ entry. Although if the page is full of zeroes, it was probably
> > > never referenced and doesn't really need a shadow entry.
> >
> > Is it possible to have a shadow entry AND a special entry (e.g.
> > XA_ZERO_ENTRY) at the same index? This is what would be required to
> > maintain the current behavior (assuming we really need the shadow
> > entries for such zeroed folios).
>
> No, just like it's not possible to have a swap entry and a shadow entry
> at the same location. You have to choose. But the zero entry is an
> alternative to the swap entry, not the shadow entry.
>
> As I understand the swap cache, at the moment, you can have four
> possible results from a lookup:
>
> - NULL
> - a swap entry
> - a shadow entry
> - a folio
>
> Do I have that wrong?
I don't think we have swap entry in the swapcache (underlying xarray).
The swap entry is used as an index to find the folio or shadow entry.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 20:39 ` Shakeel Butt
@ 2024-06-24 20:51 ` Matthew Wilcox
2024-06-24 21:02 ` Shakeel Butt
0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 20:51 UTC (permalink / raw)
To: Shakeel Butt
Cc: Yosry Ahmed, kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 01:39:45PM -0700, Shakeel Butt wrote:
> On Mon, Jun 24, 2024 at 08:50:45PM GMT, Matthew Wilcox wrote:
> > On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > > > shadow entry (which is an xa_value).
> > > > > >
> > > > > > Special entries are disjoint from shadow entries. Shadow entries have
> > > > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > > > to 2 modulo 4).
> > > > >
> > > > > You are implying that we would no longer have a shadow entry for such
> > > > > zero folios, because we will be storing a special entry instead.
> > > > > Right?
> > > >
> > > > umm ... maybe I have a misunderstanding here.
> > > >
> > > > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > > > wouldn't be stored anywhere on the swap device. But there could be a
> > > > _shadow_ entry. Although if the page is full of zeroes, it was probably
> > > > never referenced and doesn't really need a shadow entry.
> > >
> > > Is it possible to have a shadow entry AND a special entry (e.g.
> > > XA_ZERO_ENTRY) at the same index? This is what would be required to
> > > maintain the current behavior (assuming we really need the shadow
> > > entries for such zeroed folios).
> >
> > No, just like it's not possible to have a swap entry and a shadow entry
> > at the same location. You have to choose. But the zero entry is an
> > alternative to the swap entry, not the shadow entry.
> >
> > As I understand the swap cache, at the moment, you can have four
> > possible results from a lookup:
> >
> > - NULL
> > - a swap entry
> > - a shadow entry
> > - a folio
> >
> > Do I have that wrong?
>
> I don't think we have swap entry in the swapcache (underlying xarray).
> The swap entry is used as an index to find the folio or shadow entry.
Ah. I think I understand the procedure now.
We store a swap entry in the page table entry. That tells us both where
in the swap cache the folio might be found, and where in the swap device
the data can be found (because there is a very simple calculation for
both). If the folio is not present, then there's a shadow entry which
summarises the LRU information that would be stored in the folio had it
not been evicted from the swapcache.
We can't know at the point where we unmap the page whether it's full
of zeroes or not, because we can't afford to scan its contents. At the
point where we decide to swap out the folio, we can afford to make that
decision because the cost of doing the I/O is high enough.
So the question is whether we can afford to throw away the shadow
information and just store the information that this was a zero entry.
I think we can, but it is a more bold proposal than I realised I was
making.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 20:51 ` Matthew Wilcox
@ 2024-06-24 21:02 ` Shakeel Butt
0 siblings, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2024-06-24 21:02 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Yosry Ahmed, kernel test robot, Usama Arif, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, Chengming Zhou,
Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
Johannes Weiner, Andi Kleen, linux-kernel
On Mon, Jun 24, 2024 at 09:51:33PM GMT, Matthew Wilcox wrote:
> On Mon, Jun 24, 2024 at 01:39:45PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 24, 2024 at 08:50:45PM GMT, Matthew Wilcox wrote:
> > > On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> > > > On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > > > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > > > > shadow entry (which is an xa_value).
> > > > > > >
> > > > > > > Special entries are disjoint from shadow entries. Shadow entries have
> > > > > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > > > > to 2 modulo 4).
> > > > > >
> > > > > > You are implying that we would no longer have a shadow entry for such
> > > > > > zero folios, because we will be storing a special entry instead.
> > > > > > Right?
> > > > >
> > > > > umm ... maybe I have a misunderstanding here.
> > > > >
> > > > > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > > > > wouldn't be stored anywhere on the swap device. But there could be a
> > > > > _shadow_ entry. Although if the page is full of zeroes, it was probably
> > > > > never referenced and doesn't really need a shadow entry.
> > > >
> > > > Is it possible to have a shadow entry AND a special entry (e.g.
> > > > XA_ZERO_ENTRY) at the same index? This is what would be required to
> > > > maintain the current behavior (assuming we really need the shadow
> > > > entries for such zeroed folios).
> > >
> > > No, just like it's not possible to have a swap entry and a shadow entry
> > > at the same location. You have to choose. But the zero entry is an
> > > alternative to the swap entry, not the shadow entry.
> > >
> > > As I understand the swap cache, at the moment, you can have four
> > > possible results from a lookup:
> > >
> > > - NULL
> > > - a swap entry
> > > - a shadow entry
> > > - a folio
> > >
> > > Do I have that wrong?
> >
> > I don't think we have swap entry in the swapcache (underlying xarray).
> > The swap entry is used as an index to find the folio or shadow entry.
>
> Ah. I think I understand the procedure now.
>
> We store a swap entry in the page table entry. That tells us both where
> in the swap cache the folio might be found, and where in the swap device
> the data can be found (because there is a very simple calculation for
> both). If the folio is not present, then there's a shadow entry which
> summarises the LRU information that would be stored in the folio had it
> not been evicted from the swapcache.
>
> We can't know at the point where we unmap the page whether it's full
> of zeroes or not, because we can't afford to scan its contents. At the
> point where we decide to swap out the folio, we can afford to make that
> decision because the cost of doing the I/O is high enough.
>
> So the question is whether we can afford to throw away the shadow
> information and just store the information that this was a zero entry.
> I think we can, but it is a more bold proposal than I realised I was
> making.
I agree that we can throw away shadow in the favor of zero entry but, as
you already noted, it requires changes at mutiple places. At the moment
I can think of:
1. Zero entry is not reclaimable like shadow entry.
2. Need to decide the right place to allocate the zero folio on swapin.
3. Should this be treated as major fault for stats purpose.
Definitely I have missed more points as well.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
2024-06-24 18:26 ` Usama Arif
@ 2024-06-27 11:05 ` Usama Arif
0 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2024-06-27 11:05 UTC (permalink / raw)
To: Yosry Ahmed, Hugh Dickins, Yury Norov, Rasmus Villemoes
Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
Huang, Ying, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
Andi Kleen, linux-kernel, Vlastimil Babka (SUSE)
On 24/06/2024 21:26, Usama Arif wrote:
>
> On 24/06/2024 20:31, Yosry Ahmed wrote:
>> On Mon, Jun 24, 2024 at 10:26 AM Usama Arif <usamaarif642@gmail.com>
>> wrote:
>>>
>>> On 24/06/2024 19:56, Yosry Ahmed wrote:
>>>> [..]
>>>>>>> - p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>>>>>>> + p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8),
>>>>>>> GFP_KERNEL);
>>>>>> No, 8 is not right for 32-bit kernels. I think you want
>>>>>> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
>>>>>> but please check it carefully, I'm easily confused by such
>>>>>> conversions.
>>>>>>
>>>>>> Hugh
>>>>> Ah yes, didnt take into account 32-bit kernel. I think its
>>>>> supposed to be
>>>>>
>>>>> p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) *
>>>>> sizeof(unsigned long),
>>>>> GFP_KERNEL);
>>>> You can do something similar to bitmap_zalloc() and use:
>>>>
>>>> kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
>>>> | __GFP_ZERO)
>>>>
>>>> I don't see a kvzalloc_array() variant to use directly, but it should
>>>> be trivial to add it. I can see other users of kvmalloc_array() that
>>>> pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
>>>>
>>>> , or you could take it a step further and add bitmap_kvzalloc(),
>>>> assuming the maintainers are open to that.
>>> Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense
>>> that bitmap should only be limited to MAX_PAGE_ORDER size. I can add
>>> this patch below at the start of the series and use it in the patch for
>>> zeropage swap optimization.
>>>
>>>
>>> bitmap: add support for virtually contiguous bitmap
>>>
>>> The current bitmap_zalloc API limits the allocation to
>>> MAX_PAGE_ORDER,
>>> which prevents larger order bitmap allocations. Introduce
>>> bitmap_kvzalloc that will allow larger allocations of bitmap.
>>> kvmalloc_array still attempts to allocate physically
>>> contiguous memory,
>>> but upon failure, falls back to non-contiguous (vmalloc)
>>> allocation.
>>>
>>> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>
>> LGTM with a small fix below.
>>
>>> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
>>> index 8c4768c44a01..881c2ff2e834 100644
>>> --- a/include/linux/bitmap.h
>>> +++ b/include/linux/bitmap.h
>>> @@ -131,9 +131,11 @@ struct device;
>>> */
>>> unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
>>> unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
>>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
>>> unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags,
>>> int
>>> node);
>>> unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t
>>> flags, int
>>> node);
>>> void bitmap_free(const unsigned long *bitmap);
>>> +void bitmap_kvfree(const unsigned long *bitmap);
>>>
>>> DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))
>>>
>>> diff --git a/lib/bitmap.c b/lib/bitmap.c
>>> index b97692854966..eabbfb85fb45 100644
>>> --- a/lib/bitmap.c
>>> +++ b/lib/bitmap.c
>>> @@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits,
>>> gfp_t flags)
>>> }
>>> EXPORT_SYMBOL(bitmap_zalloc);
>>>
>>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
>>> +{
>>> + return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned
>>> long),
>>> + flags | __GFP_ZERO);
>>> +}
>>> +EXPORT_SYMBOL(bitmap_zalloc);
>> EXPORT_SYMBOL(bitmap_kvzalloc)*
>
>
> Actually, does it make more sense to change the behaviour of the
> current APIs like below instead of above patch? Or is there an
> expectation that the current bitmap API is supposed to work only on
> physically contiguous bits?
>
> I believe in the kernel if the allocation/free starts with 'k' its
> physically contiguous and with "kv" its physically contiguous if
> possible, otherwise virtually contiguous. The bitmap functions dont
> have either, so we could change the current implementation. I believe
> it would not impact the current users of the functions as the first
> attempt is physically contiguous which is how it works currently, and
> only upon failure it would be virtual and it would increase the use of
> current bitmap API to greater than MAX_PAGE_ORDER size allocations.
>
> Yury Norov and Rasmus Villemoes, any views on this?
>
> Thanks
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7247e217e21b..ad771dc81afa 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -804,6 +804,7 @@ kvmalloc_array_node_noprof(size_t n, size_t size,
> gfp_t flags, int node)
> #define kvcalloc_node_noprof(_n,_s,_f,_node)
> kvmalloc_array_node_noprof(_n,_s,(_f)|__GFP_ZERO,_node)
> #define kvcalloc_noprof(...) kvcalloc_node_noprof(__VA_ARGS__,
> NUMA_NO_NODE)
>
> +#define kvmalloc_array_node(...)
> alloc_hooks(kvmalloc_array_node_noprof(__VA_ARGS__))
> #define kvmalloc_array(...)
> alloc_hooks(kvmalloc_array_noprof(__VA_ARGS__))
> #define kvcalloc_node(...)
> alloc_hooks(kvcalloc_node_noprof(__VA_ARGS__))
> #define kvcalloc(...) alloc_hooks(kvcalloc_noprof(__VA_ARGS__))
> diff --git a/lib/bitmap.c b/lib/bitmap.c
> index b97692854966..272164dcbef1 100644
> --- a/lib/bitmap.c
> +++ b/lib/bitmap.c
> @@ -716,7 +716,7 @@ void bitmap_fold(unsigned long *dst, const
> unsigned long *orig,
>
> unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags)
> {
> - return kmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
> + return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned
> long),
> flags);
> }
> EXPORT_SYMBOL(bitmap_alloc);
> @@ -729,7 +729,7 @@ EXPORT_SYMBOL(bitmap_zalloc);
>
> unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
> node)
> {
> - return kmalloc_array_node(BITS_TO_LONGS(nbits),
> sizeof(unsigned long),
> + return kvmalloc_array_node(BITS_TO_LONGS(nbits),
> sizeof(unsigned long),
> flags, node);
> }
> EXPORT_SYMBOL(bitmap_alloc_node);
> @@ -742,7 +742,7 @@ EXPORT_SYMBOL(bitmap_zalloc_node);
>
> void bitmap_free(const unsigned long *bitmap)
> {
> - kfree(bitmap);
> + kvfree(bitmap);
> }
> EXPORT_SYMBOL(bitmap_free);
>
I decided to go with just using simple kvmalloc_array for v7 [1] with
__GFP_ZERO instead of adding a new API to bitmap or changing the
existing API to kvmalloc/kvfree as I didnt want to make this series
dependent of bitmap API changes and there are other places where its
done using kvmalloc_array like ceph and ntfs3. I am happy to send a
follow up patch after this series that changes the existing API to be kv
if thats something the bitmap maintainers think makes sense.
[1]
https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-06-27 11:05 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-24 8:49 [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof kernel test robot
2024-06-24 12:05 ` Yosry Ahmed
2024-06-24 13:06 ` Usama Arif
2024-06-24 15:26 ` Hugh Dickins
2024-06-24 15:39 ` Usama Arif
2024-06-24 15:55 ` Hugh Dickins
2024-06-24 16:56 ` Yosry Ahmed
2024-06-24 17:26 ` Usama Arif
2024-06-24 17:31 ` Yosry Ahmed
2024-06-24 18:26 ` Usama Arif
2024-06-27 11:05 ` Usama Arif
2024-06-24 18:33 ` Matthew Wilcox
2024-06-24 18:50 ` Usama Arif
2024-06-24 18:53 ` Yosry Ahmed
2024-06-24 18:54 ` Matthew Wilcox
2024-06-24 18:53 ` Yosry Ahmed
2024-06-24 18:56 ` Matthew Wilcox
2024-06-24 18:57 ` Yosry Ahmed
2024-06-24 19:26 ` Matthew Wilcox
2024-06-24 19:34 ` Yosry Ahmed
2024-06-24 19:50 ` Matthew Wilcox
2024-06-24 20:39 ` Shakeel Butt
2024-06-24 20:51 ` Matthew Wilcox
2024-06-24 21:02 ` Shakeel Butt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).