linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [linux-next:master] [mm]  0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
@ 2024-06-24  8:49 kernel test robot
  2024-06-24 12:05 ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: kernel test robot @ 2024-06-24  8:49 UTC (permalink / raw)
  To: Usama Arif
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Chengming Zhou, Yosry Ahmed, Nhat Pham, David Hildenbrand,
	Huang, Ying, Hugh Dickins, Johannes Weiner, Matthew Wilcox,
	Shakeel Butt, Andi Kleen, linux-kernel, oliver.sang



Hello,

kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:

commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master f76698bd9a8ca01d3581236082d786e9a6b72bb7]

in testcase: vm-scalability
version: vm-scalability-x86_64-6f4ef16-0_20240303
with following parameters:

	runtime: 300
	thp_enabled: always
	thp_defrag: always
	nr_task: 32
	nr_ssd: 1
	priority: 1
	test: swap-w-rand-mt
	cpufreq_governor: performance



compiler: gcc-13
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202406241651.963e3e78-oliver.sang@intel.com


[   34.776816][ T2413] ------------[ cut here ]------------
[ 34.782497][ T2413] WARNING: CPU: 11 PID: 2413 at mm/page_alloc.c:4685 __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[   34.792245][ T2413] Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c sd_mod t10_pi intel_rapl_msr intel_rapl_common crc64_rocksoft_generic crc64_rocksoft x86_pkg_temp_thermal crc64 intel_powerclamp sg coretemp binfmt_misc kvm_intel ipmi_ssif kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ahci ast libahci rapl drm_shmem_helper intel_cstate mei_me intel_th_gth ioatdma acpi_power_meter i2c_i801 intel_th_pci libata intel_uncore drm_kms_helper ipmi_si acpi_ipmi dax_hmem mei i2c_smbus intel_th intel_pch_thermal dca wmi ipmi_devintf ipmi_msghandler acpi_pad joydev drm fuse loop dm_mod ip_tables
[   34.849370][ T2413] CPU: 11 PID: 2413 Comm: swapon Not tainted 6.10.0-rc4-00263-g0fa2857d23aa #1
[ 34.858458][ T2413] RIP: 0010:__alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[ 34.864602][ T2413] Code: 00 00 00 48 89 54 24 08 e9 83 fe ff ff 83 fd 0a 0f 86 f6 fd ff ff 80 3d 8a f4 d6 01 00 0f 85 7f fe ff ff c6 05 7d f4 d6 01 01 <0f> 0b e9 71 fe ff ff f7 c1 00 00 80 00 75 61 f7 c1 00 00 08 00 74
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	00 48 89             	add    %cl,-0x77(%rax)
   5:	54                   	push   %rsp
   6:	24 08                	and    $0x8,%al
   8:	e9 83 fe ff ff       	jmpq   0xfffffffffffffe90
   d:	83 fd 0a             	cmp    $0xa,%ebp
  10:	0f 86 f6 fd ff ff    	jbe    0xfffffffffffffe0c
  16:	80 3d 8a f4 d6 01 00 	cmpb   $0x0,0x1d6f48a(%rip)        # 0x1d6f4a7
  1d:	0f 85 7f fe ff ff    	jne    0xfffffffffffffea2
  23:	c6 05 7d f4 d6 01 01 	movb   $0x1,0x1d6f47d(%rip)        # 0x1d6f4a7
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 71 fe ff ff       	jmpq   0xfffffffffffffea2
  31:	f7 c1 00 00 80 00    	test   $0x800000,%ecx
  37:	75 61                	jne    0x9a
  39:	f7 c1 00 00 08 00    	test   $0x80000,%ecx
  3f:	74                   	.byte 0x74

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 71 fe ff ff       	jmpq   0xfffffffffffffe78
   7:	f7 c1 00 00 80 00    	test   $0x800000,%ecx
   d:	75 61                	jne    0x70
   f:	f7 c1 00 00 08 00    	test   $0x80000,%ecx
  15:	74                   	.byte 0x74
[   34.884371][ T2413] RSP: 0018:ffa000000ce8fda8 EFLAGS: 00010246
[   34.890619][ T2413] RAX: 0000000000000000 RBX: 0000000000040dc0 RCX: 0000000000000000
[   34.898766][ T2413] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000040dc0
[   34.906910][ T2413] RBP: 000000000000000b R08: ffa000000ce8fd44 R09: ff11000104e13bc0
[   34.915074][ T2413] R10: ffa000000ce8feb0 R11: ffa0000023201000 R12: 0000000000000000
[   34.923264][ T2413] R13: 0000000000000001 R14: 0000000000000dc0 R15: 0000000003200000
[   34.931414][ T2413] FS:  00007f8ac1a03840(0000) GS:ff1100103e780000(0000) knlGS:0000000000000000
[   34.940527][ T2413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.947348][ T2413] CR2: 000056306078b000 CR3: 00000001307f4001 CR4: 0000000000771ef0
[   34.955505][ T2413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   34.963661][ T2413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   34.971815][ T2413] PKRU: 55555554
[   34.975551][ T2413] Call Trace:
[   34.979030][ T2413]  <TASK>
[ 34.982179][ T2413] ? __warn (kernel/panic.c:693) 
[ 34.986427][ T2413] ? __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[ 34.991965][ T2413] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 34.996643][ T2413] ? handle_bug (arch/x86/kernel/traps.c:239) 
[ 35.001163][ T2413] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[ 35.006011][ T2413] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 35.011233][ T2413] ? __alloc_pages_noprof (mm/page_alloc.c:4685 (discriminator 11)) 
[ 35.016765][ T2413] __kmalloc_large_node (mm/slub.c:4069) 
[ 35.022043][ T2413] __kmalloc_noprof (arch/x86/include/asm/bitops.h:417 include/asm-generic/getorder.h:46 mm/slub.c:4113 mm/slub.c:4136) 
[ 35.027066][ T2413] ? __do_sys_swapon (mm/swapfile.c:3173) 
[ 35.032196][ T2413] ? __do_sys_swapon (mm/swapfile.c:3173) 
[ 35.037290][ T2413] ? __do_sys_swapon (mm/swapfile.c:3167) 
[ 35.042379][ T2413] __do_sys_swapon (mm/swapfile.c:3173) 
[ 35.047300][ T2413] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) 
[ 35.051955][ T2413] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[   35.058002][ T2413] RIP: 0033:0x7f8ac1bcef97
[ 35.062571][ T2413] Code: 73 01 c3 48 8b 0d 69 2e 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a7 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 2e 0d 00 f7 d8 64 89 01 48
All code
========
   0:	73 01                	jae    0x3
   2:	c3                   	retq   
   3:	48 8b 0d 69 2e 0d 00 	mov    0xd2e69(%rip),%rcx        # 0xd2e73
   a:	f7 d8                	neg    %eax
   c:	64 89 01             	mov    %eax,%fs:(%rcx)
   f:	48 83 c8 ff          	or     $0xffffffffffffffff,%rax
  13:	c3                   	retq   
  14:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1b:	00 00 00 
  1e:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  23:	b8 a7 00 00 00       	mov    $0xa7,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	retq   
  33:	48 8b 0d 39 2e 0d 00 	mov    0xd2e39(%rip),%rcx        # 0xd2e73
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	retq   
   9:	48 8b 0d 39 2e 0d 00 	mov    0xd2e39(%rip),%rcx        # 0xd2e49
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W
[   35.063745][ T1492] is_virt=false
[   35.082007][ T2413] RSP: 002b:00007fffa761ac08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a7
[   35.082010][ T2413] RAX: ffffffffffffffda RBX: 000056306077c190 RCX: 00007f8ac1bcef97
[   35.082010][ T2413] RDX: 0000000000008001 RSI: 0000000000008001 RDI: 000056306077c190
[   35.082011][ T2413] RBP: 0000000000008001 R08: 0000000000000ff6 R09: 0000000000001000
[   35.082012][ T2413] R10: 4e45505355533253 R11: 0000000000000246 R12: 00007fffa761ae3c
[   35.082012][ T2413] R13: 0000000000000001 R14: 0000003200000000 R15: 000056306077cfe0
[   35.082014][ T2413]  </TASK>
[   35.082015][ T2413] ---[ end trace 0000000000000000 ]---



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240624/202406241651.963e3e78-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24  8:49 [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof kernel test robot
@ 2024-06-24 12:05 ` Yosry Ahmed
  2024-06-24 13:06   ` Usama Arif
  2024-06-24 18:33   ` Matthew Wilcox
  0 siblings, 2 replies; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 12:05 UTC (permalink / raw)
  To: kernel test robot
  Cc: Usama Arif, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
	Huang, Ying, Hugh Dickins, Johannes Weiner, Matthew Wilcox,
	Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>
> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
is triggered by the new bitmap_zalloc() call in the swapon path. For a
sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
under the hood) cannot be used to allocate the bitmap.

Usama, I think we want to use a variant of kvmalloc() here.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 12:05 ` Yosry Ahmed
@ 2024-06-24 13:06   ` Usama Arif
  2024-06-24 15:26     ` Hugh Dickins
  2024-06-24 18:33   ` Matthew Wilcox
  1 sibling, 1 reply; 24+ messages in thread
From: Usama Arif @ 2024-06-24 13:06 UTC (permalink / raw)
  To: Yosry Ahmed, kernel test robot
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Chengming Zhou, Nhat Pham, David Hildenbrand, Huang, Ying,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
	Andi Kleen, linux-kernel


On 24/06/2024 15:05, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>>
>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> under the hood) cannot be used to allocate the bitmap.
>
> Usama, I think we want to use a variant of kvmalloc() here.

Yes, just testing with below diff now. The patch is not in mm-stable 
yet, so will just send another revision with below diff included. Thanks!


diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0b8270359bcf..2263f71baa31 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
specialfile)
         free_percpu(p->cluster_next_cpu);
         p->cluster_next_cpu = NULL;
         vfree(swap_map);
-       bitmap_free(p->zeromap);
+       kvfree(p->zeromap);
         kvfree(cluster_info);
         /* Destroy swap account information */
         swap_cgroup_swapoff(p->type);
@@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
                 goto bad_swap_unlock_inode;
         }

-       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
+       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
         if (!p->zeromap) {
                 error = -ENOMEM;
                 goto bad_swap_unlock_inode;



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 13:06   ` Usama Arif
@ 2024-06-24 15:26     ` Hugh Dickins
  2024-06-24 15:39       ` Usama Arif
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2024-06-24 15:26 UTC (permalink / raw)
  To: Usama Arif
  Cc: Yosry Ahmed, kernel test robot, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Shakeel Butt, Andi Kleen,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2392 bytes --]

On Mon, 24 Jun 2024, Usama Arif wrote:
> On 24/06/2024 15:05, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com>
> > wrote:
> >>
> >>
> >> Hello,
> >>
> >> kernel test robot noticed
> >> "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >>
> >> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to
> >> be swapped out in a bitmap")
> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> > is triggered by the new bitmap_zalloc() call in the swapon path. For a
> > sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> > under the hood) cannot be used to allocate the bitmap.
> >
> > Usama, I think we want to use a variant of kvmalloc() here.

Yes, I hit the same problem with swapon in my testing,
and had been intending to send a patch.

> 
> Yes, just testing with below diff now. The patch is not in mm-stable yet, so
> will just send another revision with below diff included. Thanks!
> 
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0b8270359bcf..2263f71baa31 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *,
> specialfile)
>         free_percpu(p->cluster_next_cpu);
>         p->cluster_next_cpu = NULL;
>         vfree(swap_map);
> -       bitmap_free(p->zeromap);
> +       kvfree(p->zeromap);

Yes.

>         kvfree(cluster_info);
>         /* Destroy swap account information */
>         swap_cgroup_swapoff(p->type);
> @@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
> specialfile, int, swap_flags)
>                 goto bad_swap_unlock_inode;
>         }
> 
> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);

No, 8 is not right for 32-bit kernels. I think you want
	p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
but please check it carefully, I'm easily confused by such conversions.

Hugh

>         if (!p->zeromap) {
>                 error = -ENOMEM;
>                 goto bad_swap_unlock_inode;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 15:26     ` Hugh Dickins
@ 2024-06-24 15:39       ` Usama Arif
  2024-06-24 15:55         ` Hugh Dickins
  2024-06-24 16:56         ` Yosry Ahmed
  0 siblings, 2 replies; 24+ messages in thread
From: Usama Arif @ 2024-06-24 15:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Yosry Ahmed, kernel test robot, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
	Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel


On 24/06/2024 18:26, Hugh Dickins wrote:
> On Mon, 24 Jun 2024, Usama Arif wrote:
>> On 24/06/2024 15:05, Yosry Ahmed wrote:
>>> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> kernel test robot noticed
>>>> "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>>>>
>>>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to
>>>> be swapped out in a bitmap")
>>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
>>> is triggered by the new bitmap_zalloc() call in the swapon path. For a
>>> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
>>> under the hood) cannot be used to allocate the bitmap.
>>>
>>> Usama, I think we want to use a variant of kvmalloc() here.
> Yes, I hit the same problem with swapon in my testing,
> and had been intending to send a patch.
>
>> Yes, just testing with below diff now. The patch is not in mm-stable yet, so
>> will just send another revision with below diff included. Thanks!
>>
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 0b8270359bcf..2263f71baa31 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *,
>> specialfile)
>>          free_percpu(p->cluster_next_cpu);
>>          p->cluster_next_cpu = NULL;
>>          vfree(swap_map);
>> -       bitmap_free(p->zeromap);
>> +       kvfree(p->zeromap);
> Yes.
>
>>          kvfree(cluster_info);
>>          /* Destroy swap account information */
>>          swap_cgroup_swapoff(p->type);
>> @@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
>> specialfile, int, swap_flags)
>>                  goto bad_swap_unlock_inode;
>>          }
>>
>> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> No, 8 is not right for 32-bit kernels. I think you want
> 	p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> but please check it carefully, I'm easily confused by such conversions.
>
> Hugh

Ah yes, didnt take into account 32-bit kernel. I think its supposed to be

  p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long), 
GFP_KERNEL);

if using BITS_TO_LONGS.

Will wait sometime incase there are more comments and will send out 
another version.

Thanks!

>>          if (!p->zeromap) {
>>                  error = -ENOMEM;
>>                  goto bad_swap_unlock_inode;


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 15:39       ` Usama Arif
@ 2024-06-24 15:55         ` Hugh Dickins
  2024-06-24 16:56         ` Yosry Ahmed
  1 sibling, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2024-06-24 15:55 UTC (permalink / raw)
  To: Usama Arif
  Cc: Hugh Dickins, Yosry Ahmed, kernel test robot, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
	Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3113 bytes --]

On Mon, 24 Jun 2024, Usama Arif wrote:
> On 24/06/2024 18:26, Hugh Dickins wrote:
> > On Mon, 24 Jun 2024, Usama Arif wrote:
> >> On 24/06/2024 15:05, Yosry Ahmed wrote:
> >>> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> kernel test robot noticed
> >>>> "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >>>>
> >>>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages
> >>>> to
> >>>> be swapped out in a bitmap")
> >>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >>> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> >>> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> >>> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> >>> under the hood) cannot be used to allocate the bitmap.
> >>>
> >>> Usama, I think we want to use a variant of kvmalloc() here.
> > Yes, I hit the same problem with swapon in my testing,
> > and had been intending to send a patch.
> >
> >> Yes, just testing with below diff now. The patch is not in mm-stable yet,
> >> so
> >> will just send another revision with below diff included. Thanks!
> >>
> >>
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index 0b8270359bcf..2263f71baa31 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -2643,7 +2643,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *,
> >> specialfile)
> >>          free_percpu(p->cluster_next_cpu);
> >>          p->cluster_next_cpu = NULL;
> >>          vfree(swap_map);
> >> -       bitmap_free(p->zeromap);
> >> +       kvfree(p->zeromap);
> > Yes.
> >
> >>          kvfree(cluster_info);
> >>          /* Destroy swap account information */
> >>          swap_cgroup_swapoff(p->type);
> >> @@ -3170,7 +3170,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
> >> specialfile, int, swap_flags)
> >>                  goto bad_swap_unlock_inode;
> >>          }
> >>
> >> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> >> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> > No, 8 is not right for 32-bit kernels. I think you want
> > 	p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> > but please check it carefully, I'm easily confused by such conversions.
> >
> > Hugh
> 
> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
> 
>  p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
> GFP_KERNEL);
> 
> if using BITS_TO_LONGS.

No doubt you're right (I'm glad I already admitted to being confused).
Personally, I'd just say sizeof(long), but whatever you prefer.

Hugh

> 
> Will wait sometime incase there are more comments and will send out another
> version.
> 
> Thanks!
> 
> >>          if (!p->zeromap) {
> >>                  error = -ENOMEM;
> >>                  goto bad_swap_unlock_inode;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 15:39       ` Usama Arif
  2024-06-24 15:55         ` Hugh Dickins
@ 2024-06-24 16:56         ` Yosry Ahmed
  2024-06-24 17:26           ` Usama Arif
  1 sibling, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 16:56 UTC (permalink / raw)
  To: Usama Arif
  Cc: Hugh Dickins, kernel test robot, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
	Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel,
	Vlastimil Babka (SUSE), Yury Norov, Rasmus Villemoes

[..]
> >>
> >> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> >> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> > No, 8 is not right for 32-bit kernels. I think you want
> >       p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> > but please check it carefully, I'm easily confused by such conversions.
> >
> > Hugh
>
> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
>
>   p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
> GFP_KERNEL);

You can do something similar to bitmap_zalloc() and use:

kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
| __GFP_ZERO)

I don't see a kvzalloc_array() variant to use directly, but it should
be trivial to add it. I can see other users of kvmalloc_array() that
pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).

, or you could take it a step further and add bitmap_kvzalloc(),
assuming the maintainers are open to that.

The main reason I want to avoid doing the multiplication inline is
because kvmalloc_array() has a check_mul_overflow() check, and I
assume it must have been added for a reason.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 16:56         ` Yosry Ahmed
@ 2024-06-24 17:26           ` Usama Arif
  2024-06-24 17:31             ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2024-06-24 17:26 UTC (permalink / raw)
  To: Yosry Ahmed, Hugh Dickins
  Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
	Huang, Ying, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
	Andi Kleen, linux-kernel, Vlastimil Babka (SUSE), Yury Norov,
	Rasmus Villemoes


On 24/06/2024 19:56, Yosry Ahmed wrote:
> [..]
>>>> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>>>> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
>>> No, 8 is not right for 32-bit kernels. I think you want
>>>        p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
>>> but please check it carefully, I'm easily confused by such conversions.
>>>
>>> Hugh
>> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
>>
>>    p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
>> GFP_KERNEL);
> You can do something similar to bitmap_zalloc() and use:
>
> kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
> | __GFP_ZERO)
>
> I don't see a kvzalloc_array() variant to use directly, but it should
> be trivial to add it. I can see other users of kvmalloc_array() that
> pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
>
> , or you could take it a step further and add bitmap_kvzalloc(),
> assuming the maintainers are open to that.

Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense 
that bitmap should only be limited to MAX_PAGE_ORDER size. I can add 
this patch below at the start of the series and use it in the patch for 
zeropage swap optimization.


     bitmap: add support for virtually contiguous bitmap

     The current bitmap_zalloc API limits the allocation to MAX_PAGE_ORDER,
     which prevents larger order bitmap allocations. Introduce
     bitmap_kvzalloc that will allow larger allocations of bitmap.
     kvmalloc_array still attempts to allocate physically contiguous memory,
     but upon failure, falls back to non-contiguous (vmalloc) allocation.

     Suggested-by: Yosry Ahmed <yosryahmed@google.com>
     Signed-off-by: Usama Arif <usamaarif642@gmail.com>

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 8c4768c44a01..881c2ff2e834 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -131,9 +131,11 @@ struct device;
   */
  unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
  unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
+unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
  unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int 
node);
  unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t flags, int 
node);
  void bitmap_free(const unsigned long *bitmap);
+void bitmap_kvfree(const unsigned long *bitmap);

  DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))

diff --git a/lib/bitmap.c b/lib/bitmap.c
index b97692854966..eabbfb85fb45 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits, 
gfp_t flags)
  }
  EXPORT_SYMBOL(bitmap_zalloc);

+unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
+{
+       return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
+                             flags | __GFP_ZERO);
+}
+EXPORT_SYMBOL(bitmap_zalloc);
+
  unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int 
node)
  {
         return kmalloc_array_node(BITS_TO_LONGS(nbits), sizeof(unsigned 
long),
@@ -746,6 +753,12 @@ void bitmap_free(const unsigned long *bitmap)
  }
  EXPORT_SYMBOL(bitmap_free);

+void bitmap_kvfree(const unsigned long *bitmap)
+{
+       kvfree(bitmap);
+}
+EXPORT_SYMBOL(bitmap_kvfree);
+
  static void devm_bitmap_free(void *data)
  {
         unsigned long *bitmap = data;


>
> The main reason I want to avoid doing the multiplication inline is
> because kvmalloc_array() has a check_mul_overflow() check, and I
> assume it must have been added for a reason.


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 17:26           ` Usama Arif
@ 2024-06-24 17:31             ` Yosry Ahmed
  2024-06-24 18:26               ` Usama Arif
  0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 17:31 UTC (permalink / raw)
  To: Usama Arif
  Cc: Hugh Dickins, kernel test robot, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Johannes Weiner,
	Matthew Wilcox, Shakeel Butt, Andi Kleen, linux-kernel,
	Vlastimil Babka (SUSE), Yury Norov, Rasmus Villemoes

On Mon, Jun 24, 2024 at 10:26 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 24/06/2024 19:56, Yosry Ahmed wrote:
> > [..]
> >>>> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> >>>> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
> >>> No, 8 is not right for 32-bit kernels. I think you want
> >>>        p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
> >>> but please check it carefully, I'm easily confused by such conversions.
> >>>
> >>> Hugh
> >> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
> >>
> >>    p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
> >> GFP_KERNEL);
> > You can do something similar to bitmap_zalloc() and use:
> >
> > kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
> > | __GFP_ZERO)
> >
> > I don't see a kvzalloc_array() variant to use directly, but it should
> > be trivial to add it. I can see other users of kvmalloc_array() that
> > pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
> >
> > , or you could take it a step further and add bitmap_kvzalloc(),
> > assuming the maintainers are open to that.
>
> Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense
> that bitmap should only be limited to MAX_PAGE_ORDER size. I can add
> this patch below at the start of the series and use it in the patch for
> zeropage swap optimization.
>
>
>      bitmap: add support for virtually contiguous bitmap
>
>      The current bitmap_zalloc API limits the allocation to MAX_PAGE_ORDER,
>      which prevents larger order bitmap allocations. Introduce
>      bitmap_kvzalloc that will allow larger allocations of bitmap.
>      kvmalloc_array still attempts to allocate physically contiguous memory,
>      but upon failure, falls back to non-contiguous (vmalloc) allocation.
>
>      Suggested-by: Yosry Ahmed <yosryahmed@google.com>
>      Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>

LGTM with a small fix below.

> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 8c4768c44a01..881c2ff2e834 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -131,9 +131,11 @@ struct device;
>    */
>   unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
>   unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
>   unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
> node);
>   unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t flags, int
> node);
>   void bitmap_free(const unsigned long *bitmap);
> +void bitmap_kvfree(const unsigned long *bitmap);
>
>   DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))
>
> diff --git a/lib/bitmap.c b/lib/bitmap.c
> index b97692854966..eabbfb85fb45 100644
> --- a/lib/bitmap.c
> +++ b/lib/bitmap.c
> @@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits,
> gfp_t flags)
>   }
>   EXPORT_SYMBOL(bitmap_zalloc);
>
> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
> +{
> +       return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
> +                             flags | __GFP_ZERO);
> +}
> +EXPORT_SYMBOL(bitmap_zalloc);

EXPORT_SYMBOL(bitmap_kvzalloc)*


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 17:31             ` Yosry Ahmed
@ 2024-06-24 18:26               ` Usama Arif
  2024-06-27 11:05                 ` Usama Arif
  0 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2024-06-24 18:26 UTC (permalink / raw)
  To: Yosry Ahmed, Hugh Dickins, Yury Norov, Rasmus Villemoes
  Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
	Huang, Ying, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
	Andi Kleen, linux-kernel, Vlastimil Babka (SUSE), Yury Norov,
	Rasmus Villemoes


On 24/06/2024 20:31, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 10:26 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> On 24/06/2024 19:56, Yosry Ahmed wrote:
>>> [..]
>>>>>> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>>>>>> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), GFP_KERNEL);
>>>>> No, 8 is not right for 32-bit kernels. I think you want
>>>>>         p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
>>>>> but please check it carefully, I'm easily confused by such conversions.
>>>>>
>>>>> Hugh
>>>> Ah yes, didnt take into account 32-bit kernel. I think its supposed to be
>>>>
>>>>     p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(unsigned long),
>>>> GFP_KERNEL);
>>> You can do something similar to bitmap_zalloc() and use:
>>>
>>> kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
>>> | __GFP_ZERO)
>>>
>>> I don't see a kvzalloc_array() variant to use directly, but it should
>>> be trivial to add it. I can see other users of kvmalloc_array() that
>>> pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
>>>
>>> , or you could take it a step further and add bitmap_kvzalloc(),
>>> assuming the maintainers are open to that.
>> Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense
>> that bitmap should only be limited to MAX_PAGE_ORDER size. I can add
>> this patch below at the start of the series and use it in the patch for
>> zeropage swap optimization.
>>
>>
>>       bitmap: add support for virtually contiguous bitmap
>>
>>       The current bitmap_zalloc API limits the allocation to MAX_PAGE_ORDER,
>>       which prevents larger order bitmap allocations. Introduce
>>       bitmap_kvzalloc that will allow larger allocations of bitmap.
>>       kvmalloc_array still attempts to allocate physically contiguous memory,
>>       but upon failure, falls back to non-contiguous (vmalloc) allocation.
>>
>>       Suggested-by: Yosry Ahmed <yosryahmed@google.com>
>>       Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>
> LGTM with a small fix below.
>
>> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
>> index 8c4768c44a01..881c2ff2e834 100644
>> --- a/include/linux/bitmap.h
>> +++ b/include/linux/bitmap.h
>> @@ -131,9 +131,11 @@ struct device;
>>     */
>>    unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
>>    unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
>>    unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int
>> node);
>>    unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t flags, int
>> node);
>>    void bitmap_free(const unsigned long *bitmap);
>> +void bitmap_kvfree(const unsigned long *bitmap);
>>
>>    DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))
>>
>> diff --git a/lib/bitmap.c b/lib/bitmap.c
>> index b97692854966..eabbfb85fb45 100644
>> --- a/lib/bitmap.c
>> +++ b/lib/bitmap.c
>> @@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits,
>> gfp_t flags)
>>    }
>>    EXPORT_SYMBOL(bitmap_zalloc);
>>
>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
>> +{
>> +       return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
>> +                             flags | __GFP_ZERO);
>> +}
>> +EXPORT_SYMBOL(bitmap_zalloc);
> EXPORT_SYMBOL(bitmap_kvzalloc)*


Actually, does it make more sense to change the behaviour of the current 
APIs like below instead of above patch? Or is there an expectation that 
the current bitmap API is supposed to work only on physically contiguous 
bits?

I believe in the kernel if the allocation/free starts with 'k' its 
physically contiguous and with "kv" its physically contiguous if 
possible, otherwise virtually contiguous. The bitmap functions dont have 
either, so we could change the current implementation. I believe it 
would not impact the current users of the functions as the first attempt 
is physically contiguous which is how it works currently, and only upon 
failure it would be virtual and it would increase the use of current 
bitmap API to greater than MAX_PAGE_ORDER size allocations.

Yury Norov and Rasmus Villemoes, any views on this?

Thanks

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 7247e217e21b..ad771dc81afa 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -804,6 +804,7 @@ kvmalloc_array_node_noprof(size_t n, size_t size, 
gfp_t flags, int node)
  #define kvcalloc_node_noprof(_n,_s,_f,_node) 
kvmalloc_array_node_noprof(_n,_s,(_f)|__GFP_ZERO,_node)
  #define kvcalloc_noprof(...) kvcalloc_node_noprof(__VA_ARGS__, 
NUMA_NO_NODE)

+#define kvmalloc_array_node(...) 
alloc_hooks(kvmalloc_array_node_noprof(__VA_ARGS__))
  #define kvmalloc_array(...) 
alloc_hooks(kvmalloc_array_noprof(__VA_ARGS__))
  #define kvcalloc_node(...) alloc_hooks(kvcalloc_node_noprof(__VA_ARGS__))
  #define kvcalloc(...) alloc_hooks(kvcalloc_noprof(__VA_ARGS__))
diff --git a/lib/bitmap.c b/lib/bitmap.c
index b97692854966..272164dcbef1 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -716,7 +716,7 @@ void bitmap_fold(unsigned long *dst, const unsigned 
long *orig,

  unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags)
  {
-       return kmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
+       return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
                              flags);
  }
  EXPORT_SYMBOL(bitmap_alloc);
@@ -729,7 +729,7 @@ EXPORT_SYMBOL(bitmap_zalloc);

  unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int 
node)
  {
-       return kmalloc_array_node(BITS_TO_LONGS(nbits), sizeof(unsigned 
long),
+       return kvmalloc_array_node(BITS_TO_LONGS(nbits), sizeof(unsigned 
long),
                                   flags, node);
  }
  EXPORT_SYMBOL(bitmap_alloc_node);
@@ -742,7 +742,7 @@ EXPORT_SYMBOL(bitmap_zalloc_node);

  void bitmap_free(const unsigned long *bitmap)
  {
-       kfree(bitmap);
+       kvfree(bitmap);
  }
  EXPORT_SYMBOL(bitmap_free);



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 12:05 ` Yosry Ahmed
  2024-06-24 13:06   ` Usama Arif
@ 2024-06-24 18:33   ` Matthew Wilcox
  2024-06-24 18:50     ` Usama Arif
  2024-06-24 18:53     ` Yosry Ahmed
  1 sibling, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 18:33 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> > kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >
> > commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> under the hood) cannot be used to allocate the bitmap.

Do we need to use a bitmap?

We could place a special entry in the swapcache instead (there's
XA_ZERO_ENTRY already defined, and if we need a different entry that's
not XA_ZERO_ENTRY, there's room for a few hundred more special entries).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:33   ` Matthew Wilcox
@ 2024-06-24 18:50     ` Usama Arif
  2024-06-24 18:53       ` Yosry Ahmed
  2024-06-24 18:54       ` Matthew Wilcox
  2024-06-24 18:53     ` Yosry Ahmed
  1 sibling, 2 replies; 24+ messages in thread
From: Usama Arif @ 2024-06-24 18:50 UTC (permalink / raw)
  To: Matthew Wilcox, Yosry Ahmed
  Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
	Huang, Ying, Hugh Dickins, Johannes Weiner, Shakeel Butt,
	Andi Kleen, linux-kernel


On 24/06/2024 21:33, Matthew Wilcox wrote:
> On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
>> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
>>> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
>>>
>>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
>> is triggered by the new bitmap_zalloc() call in the swapon path. For a
>> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
>> under the hood) cannot be used to allocate the bitmap.
> Do we need to use a bitmap?
>
> We could place a special entry in the swapcache instead (there's
> XA_ZERO_ENTRY already defined, and if we need a different entry that's
> not XA_ZERO_ENTRY, there's room for a few hundred more special entries).

I was going for the most space-efficient and simplest data structure, 
which is bitmap. I believe xarray is either pointer or integer between 0 
and LONG_MAX? We could convert the individual bits into integer and 
store them, and have another function to extract the integer stored in 
xarray to a bit, but I think thats basically a separate bitmap_xarray 
API (which would probably take more space than a traditional bitmap API, 
and I dont want to make this series dependent on something like that), 
so I would prefer to use bitmap.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:33   ` Matthew Wilcox
  2024-06-24 18:50     ` Usama Arif
@ 2024-06-24 18:53     ` Yosry Ahmed
  2024-06-24 18:56       ` Matthew Wilcox
  1 sibling, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 18:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 11:33 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> > >
> > > commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> > is triggered by the new bitmap_zalloc() call in the swapon path. For a
> > sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> > under the hood) cannot be used to allocate the bitmap.
>
> Do we need to use a bitmap?
>
> We could place a special entry in the swapcache instead (there's
> XA_ZERO_ENTRY already defined, and if we need a different entry that's
> not XA_ZERO_ENTRY, there's room for a few hundred more special entries).

After a page is swapped out during reclaim, __remove_mapping() will
call __delete_from_swap_cache() to replace the swap cache entry with a
shadow entry (which is an xa_value).

So I believe we cannot use a special xarray value without making
fundamental changes. We can, perhaps, try to pack an extra bit in the
shadow entry. In this case, we will need to change the swapin code to
check for this magic bit when looking up a folio in the swap cache,
and extra synchronization to make sure concurrent lookups do not
allocate and zero separate folios.

IOW, I think it's possible but probably with more complexity, and
perhaps not worth it.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:50     ` Usama Arif
@ 2024-06-24 18:53       ` Yosry Ahmed
  2024-06-24 18:54       ` Matthew Wilcox
  1 sibling, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 18:53 UTC (permalink / raw)
  To: Usama Arif
  Cc: Matthew Wilcox, kernel test robot, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 11:50 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 24/06/2024 21:33, Matthew Wilcox wrote:
> > On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> >> On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> >>> kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> >>>
> >>> commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> >>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >> This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> >> is triggered by the new bitmap_zalloc() call in the swapon path. For a
> >> sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> >> under the hood) cannot be used to allocate the bitmap.
> > Do we need to use a bitmap?
> >
> > We could place a special entry in the swapcache instead (there's
> > XA_ZERO_ENTRY already defined, and if we need a different entry that's
> > not XA_ZERO_ENTRY, there's room for a few hundred more special entries).
>
> I was going for the most space-efficient and simplest data structure,
> which is bitmap. I believe xarray is either pointer or integer between 0
> and LONG_MAX? We could convert the individual bits into integer and
> store them, and have another function to extract the integer stored in
> xarray to a bit, but I think thats basically a separate bitmap_xarray
> API (which would probably take more space than a traditional bitmap API,
> and I dont want to make this series dependent on something like that),
> so I would prefer to use bitmap.

I believe Matthew meant reusing the xarray used by the existing
swapcache, not adding a new one for this purpose. See my response.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:50     ` Usama Arif
  2024-06-24 18:53       ` Yosry Ahmed
@ 2024-06-24 18:54       ` Matthew Wilcox
  1 sibling, 0 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 18:54 UTC (permalink / raw)
  To: Usama Arif
  Cc: Yosry Ahmed, kernel test robot, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 09:50:21PM +0300, Usama Arif wrote:
> On 24/06/2024 21:33, Matthew Wilcox wrote:
> > On Mon, Jun 24, 2024 at 05:05:56AM -0700, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 1:49 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > > kernel test robot noticed "WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof" on:
> > > > 
> > > > commit: 0fa2857d23aa170e5e28d13c467b303b0065aad8 ("mm: store zero pages to be swapped out in a bitmap")
> > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > This is coming from WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp), and
> > > is triggered by the new bitmap_zalloc() call in the swapon path. For a
> > > sufficiently large swapfile, bitmap_zalloc() (which uses kmalloc()
> > > under the hood) cannot be used to allocate the bitmap.
> > Do we need to use a bitmap?
> > 
> > We could place a special entry in the swapcache instead (there's
> > XA_ZERO_ENTRY already defined, and if we need a different entry that's
> > not XA_ZERO_ENTRY, there's room for a few hundred more special entries).
> 
> I was going for the most space-efficient and simplest data structure, which
> is bitmap. I believe xarray is either pointer or integer between 0 and
> LONG_MAX? We could convert the individual bits into integer and store them,
> and have another function to extract the integer stored in xarray to a bit,
> but I think thats basically a separate bitmap_xarray API (which would
> probably take more space than a traditional bitmap API, and I dont want to
> make this series dependent on something like that), so I would prefer to use
> bitmap.

But we already _have_ an xarray.  Instead of storing a swap entry in
it, we could store an XA_ZERO_ENTRY.

If there are long runs of zero pages, then this may not be the best
idea.  But then a bitmap isn't the best data structure for long runs
either.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:53     ` Yosry Ahmed
@ 2024-06-24 18:56       ` Matthew Wilcox
  2024-06-24 18:57         ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 18:56 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> After a page is swapped out during reclaim, __remove_mapping() will
> call __delete_from_swap_cache() to replace the swap cache entry with a
> shadow entry (which is an xa_value).

Special entries are disjoint from shadow entries.  Shadow entries have
the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
Special entries have values below 4096 which end in 10 (are congruent
to 2 modulo 4).



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:56       ` Matthew Wilcox
@ 2024-06-24 18:57         ` Yosry Ahmed
  2024-06-24 19:26           ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 18:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > After a page is swapped out during reclaim, __remove_mapping() will
> > call __delete_from_swap_cache() to replace the swap cache entry with a
> > shadow entry (which is an xa_value).
>
> Special entries are disjoint from shadow entries.  Shadow entries have
> the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> Special entries have values below 4096 which end in 10 (are congruent
> to 2 modulo 4).

You are implying that we would no longer have a shadow entry for such
zero folios, because we will be storing a special entry instead.
Right?

This is the "fundamental" change I am talking about.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:57         ` Yosry Ahmed
@ 2024-06-24 19:26           ` Matthew Wilcox
  2024-06-24 19:34             ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 19:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > After a page is swapped out during reclaim, __remove_mapping() will
> > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > shadow entry (which is an xa_value).
> >
> > Special entries are disjoint from shadow entries.  Shadow entries have
> > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > Special entries have values below 4096 which end in 10 (are congruent
> > to 2 modulo 4).
> 
> You are implying that we would no longer have a shadow entry for such
> zero folios, because we will be storing a special entry instead.
> Right?

umm ... maybe I have a misunderstanding here.

I'm saying that there wouldn't be a _swap_ entry here because the folio
wouldn't be stored anywhere on the swap device.  But there could be a
_shadow_ entry.  Although if the page is full of zeroes, it was probably
never referenced and doesn't really need a shadow entry.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 19:26           ` Matthew Wilcox
@ 2024-06-24 19:34             ` Yosry Ahmed
  2024-06-24 19:50               ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2024-06-24 19:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > shadow entry (which is an xa_value).
> > >
> > > Special entries are disjoint from shadow entries.  Shadow entries have
> > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > Special entries have values below 4096 which end in 10 (are congruent
> > > to 2 modulo 4).
> >
> > You are implying that we would no longer have a shadow entry for such
> > zero folios, because we will be storing a special entry instead.
> > Right?
>
> umm ... maybe I have a misunderstanding here.
>
> I'm saying that there wouldn't be a _swap_ entry here because the folio
> wouldn't be stored anywhere on the swap device.  But there could be a
> _shadow_ entry.  Although if the page is full of zeroes, it was probably
> never referenced and doesn't really need a shadow entry.

Is it possible to have a shadow entry AND a special entry (e.g.
XA_ZERO_ENTRY) at the same index? This is what would be required to
maintain the current behavior (assuming we really need the shadow
entries for such zeroed folios).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 19:34             ` Yosry Ahmed
@ 2024-06-24 19:50               ` Matthew Wilcox
  2024-06-24 20:39                 ` Shakeel Butt
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 19:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Shakeel Butt, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > shadow entry (which is an xa_value).
> > > >
> > > > Special entries are disjoint from shadow entries.  Shadow entries have
> > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > to 2 modulo 4).
> > >
> > > You are implying that we would no longer have a shadow entry for such
> > > zero folios, because we will be storing a special entry instead.
> > > Right?
> >
> > umm ... maybe I have a misunderstanding here.
> >
> > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > wouldn't be stored anywhere on the swap device.  But there could be a
> > _shadow_ entry.  Although if the page is full of zeroes, it was probably
> > never referenced and doesn't really need a shadow entry.
> 
> Is it possible to have a shadow entry AND a special entry (e.g.
> XA_ZERO_ENTRY) at the same index? This is what would be required to
> maintain the current behavior (assuming we really need the shadow
> entries for such zeroed folios).

No, just like it's not possible to have a swap entry and a shadow entry
at the same location.  You have to choose.  But the zero entry is an
alternative to the swap entry, not the shadow entry.

As I understand the swap cache, at the moment, you can have four
possible results from a lookup:

 - NULL
 - a swap entry
 - a shadow entry
 - a folio

Do I have that wrong?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 19:50               ` Matthew Wilcox
@ 2024-06-24 20:39                 ` Shakeel Butt
  2024-06-24 20:51                   ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Shakeel Butt @ 2024-06-24 20:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yosry Ahmed, kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 08:50:45PM GMT, Matthew Wilcox wrote:
> On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > > shadow entry (which is an xa_value).
> > > > >
> > > > > Special entries are disjoint from shadow entries.  Shadow entries have
> > > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > > to 2 modulo 4).
> > > >
> > > > You are implying that we would no longer have a shadow entry for such
> > > > zero folios, because we will be storing a special entry instead.
> > > > Right?
> > >
> > > umm ... maybe I have a misunderstanding here.
> > >
> > > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > > wouldn't be stored anywhere on the swap device.  But there could be a
> > > _shadow_ entry.  Although if the page is full of zeroes, it was probably
> > > never referenced and doesn't really need a shadow entry.
> > 
> > Is it possible to have a shadow entry AND a special entry (e.g.
> > XA_ZERO_ENTRY) at the same index? This is what would be required to
> > maintain the current behavior (assuming we really need the shadow
> > entries for such zeroed folios).
> 
> No, just like it's not possible to have a swap entry and a shadow entry
> at the same location.  You have to choose.  But the zero entry is an
> alternative to the swap entry, not the shadow entry.
> 
> As I understand the swap cache, at the moment, you can have four
> possible results from a lookup:
> 
>  - NULL
>  - a swap entry
>  - a shadow entry
>  - a folio
> 
> Do I have that wrong?

I don't think we have swap entry in the swapcache (underlying xarray).
The swap entry is used as an index to find the folio or shadow entry.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 20:39                 ` Shakeel Butt
@ 2024-06-24 20:51                   ` Matthew Wilcox
  2024-06-24 21:02                     ` Shakeel Butt
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2024-06-24 20:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yosry Ahmed, kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 01:39:45PM -0700, Shakeel Butt wrote:
> On Mon, Jun 24, 2024 at 08:50:45PM GMT, Matthew Wilcox wrote:
> > On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > > > shadow entry (which is an xa_value).
> > > > > >
> > > > > > Special entries are disjoint from shadow entries.  Shadow entries have
> > > > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > > > to 2 modulo 4).
> > > > >
> > > > > You are implying that we would no longer have a shadow entry for such
> > > > > zero folios, because we will be storing a special entry instead.
> > > > > Right?
> > > >
> > > > umm ... maybe I have a misunderstanding here.
> > > >
> > > > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > > > wouldn't be stored anywhere on the swap device.  But there could be a
> > > > _shadow_ entry.  Although if the page is full of zeroes, it was probably
> > > > never referenced and doesn't really need a shadow entry.
> > > 
> > > Is it possible to have a shadow entry AND a special entry (e.g.
> > > XA_ZERO_ENTRY) at the same index? This is what would be required to
> > > maintain the current behavior (assuming we really need the shadow
> > > entries for such zeroed folios).
> > 
> > No, just like it's not possible to have a swap entry and a shadow entry
> > at the same location.  You have to choose.  But the zero entry is an
> > alternative to the swap entry, not the shadow entry.
> > 
> > As I understand the swap cache, at the moment, you can have four
> > possible results from a lookup:
> > 
> >  - NULL
> >  - a swap entry
> >  - a shadow entry
> >  - a folio
> > 
> > Do I have that wrong?
> 
> I don't think we have swap entry in the swapcache (underlying xarray).
> The swap entry is used as an index to find the folio or shadow entry.

Ah.  I think I understand the procedure now.

We store a swap entry in the page table entry.  That tells us both where
in the swap cache the folio might be found, and where in the swap device
the data can be found (because there is a very simple calculation for
both).  If the folio is not present, then there's a shadow entry which
summarises the LRU information that would be stored in the folio had it
not been evicted from the swapcache.

We can't know at the point where we unmap the page whether it's full
of zeroes or not, because we can't afford to scan its contents.  At the
point where we decide to swap out the folio, we can afford to make that
decision because the cost of doing the I/O is high enough.

So the question is whether we can afford to throw away the shadow
information and just store the information that this was a zero entry.
I think we can, but it is a more bold proposal than I realised I was
making.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 20:51                   ` Matthew Wilcox
@ 2024-06-24 21:02                     ` Shakeel Butt
  0 siblings, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2024-06-24 21:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yosry Ahmed, kernel test robot, Usama Arif, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, Chengming Zhou,
	Nhat Pham, David Hildenbrand, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Andi Kleen, linux-kernel

On Mon, Jun 24, 2024 at 09:51:33PM GMT, Matthew Wilcox wrote:
> On Mon, Jun 24, 2024 at 01:39:45PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 24, 2024 at 08:50:45PM GMT, Matthew Wilcox wrote:
> > > On Mon, Jun 24, 2024 at 12:34:04PM -0700, Yosry Ahmed wrote:
> > > > On Mon, Jun 24, 2024 at 12:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 11:57:45AM -0700, Yosry Ahmed wrote:
> > > > > > On Mon, Jun 24, 2024 at 11:56 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Mon, Jun 24, 2024 at 11:53:30AM -0700, Yosry Ahmed wrote:
> > > > > > > > After a page is swapped out during reclaim, __remove_mapping() will
> > > > > > > > call __delete_from_swap_cache() to replace the swap cache entry with a
> > > > > > > > shadow entry (which is an xa_value).
> > > > > > >
> > > > > > > Special entries are disjoint from shadow entries.  Shadow entries have
> > > > > > > the last two bits as 01 or 11 (are congruent to 1 or 3 modulo 4).
> > > > > > > Special entries have values below 4096 which end in 10 (are congruent
> > > > > > > to 2 modulo 4).
> > > > > >
> > > > > > You are implying that we would no longer have a shadow entry for such
> > > > > > zero folios, because we will be storing a special entry instead.
> > > > > > Right?
> > > > >
> > > > > umm ... maybe I have a misunderstanding here.
> > > > >
> > > > > I'm saying that there wouldn't be a _swap_ entry here because the folio
> > > > > wouldn't be stored anywhere on the swap device.  But there could be a
> > > > > _shadow_ entry.  Although if the page is full of zeroes, it was probably
> > > > > never referenced and doesn't really need a shadow entry.
> > > > 
> > > > Is it possible to have a shadow entry AND a special entry (e.g.
> > > > XA_ZERO_ENTRY) at the same index? This is what would be required to
> > > > maintain the current behavior (assuming we really need the shadow
> > > > entries for such zeroed folios).
> > > 
> > > No, just like it's not possible to have a swap entry and a shadow entry
> > > at the same location.  You have to choose.  But the zero entry is an
> > > alternative to the swap entry, not the shadow entry.
> > > 
> > > As I understand the swap cache, at the moment, you can have four
> > > possible results from a lookup:
> > > 
> > >  - NULL
> > >  - a swap entry
> > >  - a shadow entry
> > >  - a folio
> > > 
> > > Do I have that wrong?
> > 
> > I don't think we have swap entry in the swapcache (underlying xarray).
> > The swap entry is used as an index to find the folio or shadow entry.
> 
> Ah.  I think I understand the procedure now.
> 
> We store a swap entry in the page table entry.  That tells us both where
> in the swap cache the folio might be found, and where in the swap device
> the data can be found (because there is a very simple calculation for
> both).  If the folio is not present, then there's a shadow entry which
> summarises the LRU information that would be stored in the folio had it
> not been evicted from the swapcache.
> 
> We can't know at the point where we unmap the page whether it's full
> of zeroes or not, because we can't afford to scan its contents.  At the
> point where we decide to swap out the folio, we can afford to make that
> decision because the cost of doing the I/O is high enough.
> 
> So the question is whether we can afford to throw away the shadow
> information and just store the information that this was a zero entry.
> I think we can, but it is a more bold proposal than I realised I was
> making.

I agree that we can throw away shadow in the favor of zero entry but, as
you already noted, it requires changes at mutiple places. At the moment
I can think of:

1. Zero entry is not reclaimable like shadow entry.
2. Need to decide the right place to allocate the zero folio on swapin.
3. Should this be treated as major fault for stats purpose.

Definitely I have missed more points as well.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof
  2024-06-24 18:26               ` Usama Arif
@ 2024-06-27 11:05                 ` Usama Arif
  0 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2024-06-27 11:05 UTC (permalink / raw)
  To: Yosry Ahmed, Hugh Dickins, Yury Norov, Rasmus Villemoes
  Cc: kernel test robot, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Chengming Zhou, Nhat Pham, David Hildenbrand,
	Huang, Ying, Johannes Weiner, Matthew Wilcox, Shakeel Butt,
	Andi Kleen, linux-kernel, Vlastimil Babka (SUSE)


On 24/06/2024 21:26, Usama Arif wrote:
>
> On 24/06/2024 20:31, Yosry Ahmed wrote:
>> On Mon, Jun 24, 2024 at 10:26 AM Usama Arif <usamaarif642@gmail.com> 
>> wrote:
>>>
>>> On 24/06/2024 19:56, Yosry Ahmed wrote:
>>>> [..]
>>>>>>> -       p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
>>>>>>> +       p->zeromap = kvzalloc(DIV_ROUND_UP(maxpages, 8), 
>>>>>>> GFP_KERNEL);
>>>>>> No, 8 is not right for 32-bit kernels. I think you want
>>>>>>         p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages), GFP_KERNEL);
>>>>>> but please check it carefully, I'm easily confused by such 
>>>>>> conversions.
>>>>>>
>>>>>> Hugh
>>>>> Ah yes, didnt take into account 32-bit kernel. I think its 
>>>>> supposed to be
>>>>>
>>>>>     p->zeromap = kvzalloc(BITS_TO_LONGS(maxpages) * 
>>>>> sizeof(unsigned long),
>>>>> GFP_KERNEL);
>>>> You can do something similar to bitmap_zalloc() and use:
>>>>
>>>> kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long), GFP_KERNEL
>>>> | __GFP_ZERO)
>>>>
>>>> I don't see a kvzalloc_array() variant to use directly, but it should
>>>> be trivial to add it. I can see other users of kvmalloc_array() that
>>>> pass in __GFP_ZERO (e.g. fs/ntfs3/bitmap.c).
>>>>
>>>> , or you could take it a step further and add bitmap_kvzalloc(),
>>>> assuming the maintainers are open to that.
>>> Thanks! bitmap_kvzalloc makes most sense to me. It doesnt make sense
>>> that bitmap should only be limited to MAX_PAGE_ORDER size. I can add
>>> this patch below at the start of the series and use it in the patch for
>>> zeropage swap optimization.
>>>
>>>
>>>       bitmap: add support for virtually contiguous bitmap
>>>
>>>       The current bitmap_zalloc API limits the allocation to 
>>> MAX_PAGE_ORDER,
>>>       which prevents larger order bitmap allocations. Introduce
>>>       bitmap_kvzalloc that will allow larger allocations of bitmap.
>>>       kvmalloc_array still attempts to allocate physically 
>>> contiguous memory,
>>>       but upon failure, falls back to non-contiguous (vmalloc) 
>>> allocation.
>>>
>>>       Suggested-by: Yosry Ahmed <yosryahmed@google.com>
>>>       Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>
>> LGTM with a small fix below.
>>
>>> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
>>> index 8c4768c44a01..881c2ff2e834 100644
>>> --- a/include/linux/bitmap.h
>>> +++ b/include/linux/bitmap.h
>>> @@ -131,9 +131,11 @@ struct device;
>>>     */
>>>    unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags);
>>>    unsigned long *bitmap_zalloc(unsigned int nbits, gfp_t flags);
>>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags);
>>>    unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, 
>>> int
>>> node);
>>>    unsigned long *bitmap_zalloc_node(unsigned int nbits, gfp_t 
>>> flags, int
>>> node);
>>>    void bitmap_free(const unsigned long *bitmap);
>>> +void bitmap_kvfree(const unsigned long *bitmap);
>>>
>>>    DEFINE_FREE(bitmap, unsigned long *, if (_T) bitmap_free(_T))
>>>
>>> diff --git a/lib/bitmap.c b/lib/bitmap.c
>>> index b97692854966..eabbfb85fb45 100644
>>> --- a/lib/bitmap.c
>>> +++ b/lib/bitmap.c
>>> @@ -727,6 +727,13 @@ unsigned long *bitmap_zalloc(unsigned int nbits,
>>> gfp_t flags)
>>>    }
>>>    EXPORT_SYMBOL(bitmap_zalloc);
>>>
>>> +unsigned long *bitmap_kvzalloc(unsigned int nbits, gfp_t flags)
>>> +{
>>> +       return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned 
>>> long),
>>> +                             flags | __GFP_ZERO);
>>> +}
>>> +EXPORT_SYMBOL(bitmap_zalloc);
>> EXPORT_SYMBOL(bitmap_kvzalloc)*
>
>
> Actually, does it make more sense to change the behaviour of the 
> current APIs like below instead of above patch? Or is there an 
> expectation that the current bitmap API is supposed to work only on 
> physically contiguous bits?
>
> I believe in the kernel if the allocation/free starts with 'k' its 
> physically contiguous and with "kv" its physically contiguous if 
> possible, otherwise virtually contiguous. The bitmap functions dont 
> have either, so we could change the current implementation. I believe 
> it would not impact the current users of the functions as the first 
> attempt is physically contiguous which is how it works currently, and 
> only upon failure it would be virtual and it would increase the use of 
> current bitmap API to greater than MAX_PAGE_ORDER size allocations.
>
> Yury Norov and Rasmus Villemoes, any views on this?
>
> Thanks
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7247e217e21b..ad771dc81afa 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -804,6 +804,7 @@ kvmalloc_array_node_noprof(size_t n, size_t size, 
> gfp_t flags, int node)
>  #define kvcalloc_node_noprof(_n,_s,_f,_node) 
> kvmalloc_array_node_noprof(_n,_s,(_f)|__GFP_ZERO,_node)
>  #define kvcalloc_noprof(...) kvcalloc_node_noprof(__VA_ARGS__, 
> NUMA_NO_NODE)
>
> +#define kvmalloc_array_node(...) 
> alloc_hooks(kvmalloc_array_node_noprof(__VA_ARGS__))
>  #define kvmalloc_array(...) 
> alloc_hooks(kvmalloc_array_noprof(__VA_ARGS__))
>  #define kvcalloc_node(...) 
> alloc_hooks(kvcalloc_node_noprof(__VA_ARGS__))
>  #define kvcalloc(...) alloc_hooks(kvcalloc_noprof(__VA_ARGS__))
> diff --git a/lib/bitmap.c b/lib/bitmap.c
> index b97692854966..272164dcbef1 100644
> --- a/lib/bitmap.c
> +++ b/lib/bitmap.c
> @@ -716,7 +716,7 @@ void bitmap_fold(unsigned long *dst, const 
> unsigned long *orig,
>
>  unsigned long *bitmap_alloc(unsigned int nbits, gfp_t flags)
>  {
> -       return kmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned long),
> +       return kvmalloc_array(BITS_TO_LONGS(nbits), sizeof(unsigned 
> long),
>                              flags);
>  }
>  EXPORT_SYMBOL(bitmap_alloc);
> @@ -729,7 +729,7 @@ EXPORT_SYMBOL(bitmap_zalloc);
>
>  unsigned long *bitmap_alloc_node(unsigned int nbits, gfp_t flags, int 
> node)
>  {
> -       return kmalloc_array_node(BITS_TO_LONGS(nbits), 
> sizeof(unsigned long),
> +       return kvmalloc_array_node(BITS_TO_LONGS(nbits), 
> sizeof(unsigned long),
>                                   flags, node);
>  }
>  EXPORT_SYMBOL(bitmap_alloc_node);
> @@ -742,7 +742,7 @@ EXPORT_SYMBOL(bitmap_zalloc_node);
>
>  void bitmap_free(const unsigned long *bitmap)
>  {
> -       kfree(bitmap);
> +       kvfree(bitmap);
>  }
>  EXPORT_SYMBOL(bitmap_free);
>

I decided to go with just using simple kvmalloc_array for v7 [1] with 
__GFP_ZERO instead of adding a new API to bitmap or changing the 
existing API to kvmalloc/kvfree as I didnt want to make this series 
dependent of bitmap API changes and there are other places where its 
done using kvmalloc_array like ceph and ntfs3. I am happy to send a 
follow up patch after this series that changes the existing API to be kv 
if thats something the bitmap maintainers think makes sense.

[1] 
https://lore.kernel.org/all/20240627105730.3110705-1-usamaarif642@gmail.com/



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-06-27 11:05 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-24  8:49 [linux-next:master] [mm] 0fa2857d23: WARNING:at_mm/page_alloc.c:#__alloc_pages_noprof kernel test robot
2024-06-24 12:05 ` Yosry Ahmed
2024-06-24 13:06   ` Usama Arif
2024-06-24 15:26     ` Hugh Dickins
2024-06-24 15:39       ` Usama Arif
2024-06-24 15:55         ` Hugh Dickins
2024-06-24 16:56         ` Yosry Ahmed
2024-06-24 17:26           ` Usama Arif
2024-06-24 17:31             ` Yosry Ahmed
2024-06-24 18:26               ` Usama Arif
2024-06-27 11:05                 ` Usama Arif
2024-06-24 18:33   ` Matthew Wilcox
2024-06-24 18:50     ` Usama Arif
2024-06-24 18:53       ` Yosry Ahmed
2024-06-24 18:54       ` Matthew Wilcox
2024-06-24 18:53     ` Yosry Ahmed
2024-06-24 18:56       ` Matthew Wilcox
2024-06-24 18:57         ` Yosry Ahmed
2024-06-24 19:26           ` Matthew Wilcox
2024-06-24 19:34             ` Yosry Ahmed
2024-06-24 19:50               ` Matthew Wilcox
2024-06-24 20:39                 ` Shakeel Butt
2024-06-24 20:51                   ` Matthew Wilcox
2024-06-24 21:02                     ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).