Re: Endless calls to xas_split_alloc() due to corrupted xarray entry

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
       [not found] ` <ZRFbIJH47RkQuDid@debian.me>
@ 2023-09-25 15:12   ` Darrick J. Wong
  2023-09-26  7:49     ` Zhenyu Zhang
  2023-09-29 19:17   ` Matthew Wilcox
  1 sibling, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2023-09-25 15:12 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Zhenyu Zhang, Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Guowen Shan, Shaoqin Huang,
	Matthew Wilcox, Chandan Babu R, Andrew Morton, Linus Torvalds

On Mon, Sep 25, 2023 at 05:04:16PM +0700, Bagas Sanjaya wrote:
> On Fri, Sep 22, 2023 at 11:56:43AM +0800, Zhenyu Zhang wrote:
> > Hi all,
> > 
> > we don't know how the xarray entry was corrupted. Maybe it's a known
> > issue to community.
> > Lets see.
> > 
> > Contents
> > --------
> > 1. Problem Statement
> > 2. The call trace
> > 3. The captured data by bpftrace
> > 
> > 
> > 1. Problem Statement
> > --------------------
> > With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
> >     Steps:
> >     1) System setup hugepages on host.
> >        # echo 60 > /proc/sys/vm/nr_hugepages
> >     2) Mount this hugepage to /mnt/kvm_hugepage.
> >        # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
> 
> What block device/disk image you use to format the filesystem?
> 
> >     3) HugePages didn't leak when using non-existent mem-path.
> >        # mkdir -p /mnt/tmp
> >     4) Boot guest.
> >        # /usr/libexec/qemu-kvm \
> > ...
> >          -m 30720 \
> > -object '{"size": 32212254720, "mem-path": "/mnt/tmp", "qom-type":
> > "memory-backend-file"}'  \
> > -smp 4,maxcpus=4,cores=2,threads=1,clusters=1,sockets=2  \
> >          -blockdev '{"node-name": "file_image1", "driver": "file",
> > "auto-read-only": true, "discard": "unmap", "aio": "threads",
> > "filename": "/home/kvm_autotest_root/images/back_up_4k.qcow2",
> > "cache": {"direct": true, "no-flush": false}}' \
> > -blockdev '{"node-name": "drive_image1", "driver": "qcow2",
> > "read-only": false, "cache": {"direct": true, "no-flush": false},
> > "file": "file_image1"}' \
> > -device '{"driver": "scsi-hd", "id": "image1", "drive":
> > "drive_image1", "write-cache": "on"}' \
> > 
> >     5) Wait about 1 minute ------> hit Call trace
> > 
> > 2. The call trace
> > --------------------
> > [   14.982751] block dm-0: the capability attribute has been deprecated.
> > [   15.690043] PEFILE: Unsigned PE binary
> > 
> > 
> > [   90.135676] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [   90.136629] rcu: 3-...0: (3 ticks this GP)
> > idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=232
> > [   90.137293] rcu: (detected by 2, t=6012 jiffies, g=2085, q=2539 ncpus=4)
> > [   90.137796] Task dump for CPU 3:
> > [   90.138037] task:PK-Backend      state:R  running task     stack:0
> >    pid:2287  ppid:1      flags:0x00000202
> > [   90.138757] Call trace:
> > [   90.138940]  __switch_to+0xc8/0x110
> > [   90.139203]  0xb54a54f8c5fb0700
> > 
> > [  270.190849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  270.191722] rcu: 3-...0: (3 ticks this GP)
> > idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=1020
> > [  270.192405] rcu: (detected by 1, t=24018 jiffies, g=2085, q=3104 ncpus=4)
> > [  270.192876] Task dump for CPU 3:
> > [  270.193099] task:PK-Backend      state:R  running task     stack:0
> >    pid:2287  ppid:1      flags:0x00000202
> > [  270.193774] Call trace:
> > [  270.193946]  __switch_to+0xc8/0x110
> > [  270.194336]  0xb54a54f8c5fb0700
> > 
> > [ 1228.068406] ------------[ cut here ]------------
> > [ 1228.073011] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
> > xas_split_alloc+0xf8/0x128
> > [ 1228.080828] Modules linked in: binfmt_misc vhost_net vhost
> > vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> > nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
> > qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
> > ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
> > arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
> > crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
> > i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
> > i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
> > fuse
> > [ 1228.137630] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
> > G        W          6.6.0-rc2-zhenyzha+ #5

Please try 6.6-rc3, which doesn't have broken bit spinlocks (and hence
corruption problems in the vfs) on arm64.

(See commit 6d2779ecaeb56f9 "locking/atomic: scripts: fix fallback
ifdeffery")

--D

> > [ 1228.147529] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> > F31h (SCP: 2.10.20220810) 07/27/2022
> > [ 1228.156820] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ 1228.163767] pc : xas_split_alloc+0xf8/0x128
> > [ 1228.167938] lr : __filemap_add_folio+0x33c/0x4e0
> > [ 1228.172543] sp : ffff80008dd4f1c0
> > [ 1228.175844] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
> > [ 1228.182967] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
> > [ 1228.190089] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
> > [ 1228.197211] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
> > [ 1228.204334] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
> > [ 1228.211456] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
> > [ 1228.218578] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
> > [ 1228.225701] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
> > [ 1228.232823] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
> > [ 1228.239945] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> > [ 1228.247067] Call trace:
> > [ 1228.249500]  xas_split_alloc+0xf8/0x128
> > [ 1228.253324]  __filemap_add_folio+0x33c/0x4e0
> > [ 1228.257582]  filemap_add_folio+0x48/0xd0
> > [ 1228.261493]  page_cache_ra_order+0x214/0x310
> > [ 1228.265750]  ondemand_readahead+0x1a8/0x320
> > [ 1228.269921]  page_cache_async_ra+0x64/0xa8
> > [ 1228.274005]  filemap_fault+0x238/0xaa8
> > [ 1228.277742]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> > [ 1228.282491]  xfs_filemap_fault+0x54/0x68 [xfs]
> > [ 1228.286979]  __do_fault+0x40/0x210
> > [ 1228.290368]  do_cow_fault+0xf0/0x300
> > [ 1228.293931]  do_pte_missing+0x140/0x238
> > [ 1228.297754]  handle_pte_fault+0x100/0x160
> > [ 1228.301751]  __handle_mm_fault+0x100/0x310
> > [ 1228.305835]  handle_mm_fault+0x6c/0x270
> > [ 1228.309659]  faultin_page+0x70/0x128
> > [ 1228.313221]  __get_user_pages+0xc8/0x2d8
> > [ 1228.317131]  get_user_pages_unlocked+0xc4/0x3b8
> > [ 1228.321648]  hva_to_pfn+0xf8/0x468
> > [ 1228.325037]  __gfn_to_pfn_memslot+0xa4/0xf8
> > [ 1228.329207]  user_mem_abort+0x174/0x7e8
> > [ 1228.333031]  kvm_handle_guest_abort+0x2dc/0x450
> > [ 1228.337548]  handle_exit+0x70/0x1c8
> > [ 1228.341024]  kvm_arch_vcpu_ioctl_run+0x224/0x5b8
> > [ 1228.345628]  kvm_vcpu_ioctl+0x28c/0x9d0
> > [ 1228.349450]  __arm64_sys_ioctl+0xa8/0xf0
> > [ 1228.353360]  invoke_syscall.constprop.0+0x7c/0xd0
> > [ 1228.358052]  do_el0_svc+0xb4/0xd0
> > [ 1228.361354]  el0_svc+0x50/0x228
> > [ 1228.364484]  el0t_64_sync_handler+0x134/0x150
> > [ 1228.368827]  el0t_64_sync+0x17c/0x180
> > [ 1228.372476] ---[ end trace 0000000000000000 ]---
> > [ 1228.377124] ------------[ cut here ]------------
> > [ 1228.381728] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
> > xas_split_alloc+0xf8/0x128
> > [ 1228.389546] Modules linked in: binfmt_misc vhost_net vhost
> > vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> > nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
> > qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
> > ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
> > arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
> > crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
> > i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
> > i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
> > fuse
> > [ 1228.446348] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
> > G        W          6.6.0-rc2-zhenyzha+ #5
> > [ 1228.456248] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> > F31h (SCP: 2.10.20220810) 07/27/2022
> > [ 1228.465538] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ 1228.472486] pc : xas_split_alloc+0xf8/0x128
> > [ 1228.476656] lr : __filemap_add_folio+0x33c/0x4e0
> > [ 1228.481261] sp : ffff80008dd4f1c0
> > [ 1228.484563] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
> > [ 1228.491685] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
> > [ 1228.498807] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
> > [ 1228.505930] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
> > [ 1228.513052] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
> > [ 1228.520174] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
> > [ 1228.527297] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
> > [ 1228.534419] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
> > [ 1228.541542] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
> > [ 1228.548664] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> > [ 1228.555786] Call trace:
> > [ 1228.558220]  xas_split_alloc+0xf8/0x128
> > [ 1228.562043]  __filemap_add_folio+0x33c/0x4e0
> > [ 1228.566300]  filemap_add_folio+0x48/0xd0
> > [ 1228.570211]  page_cache_ra_order+0x214/0x310
> > [ 1228.574469]  ondemand_readahead+0x1a8/0x320
> > [ 1228.578639]  page_cache_async_ra+0x64/0xa8
> > [ 1228.582724]  filemap_fault+0x238/0xaa8
> > [ 1228.586460]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> > [ 1228.591210]  xfs_filemap_fault+0x54/0x68 [xfs]
> > 
> > 
> > 
> > 3. The captured data by bpftrace
> > (The following part is the crawl analysis of gshan@redhat.com )
> > --------------------
> > pid:  4475    task: qemu-kvm
> > file: /mnt/tmp/qemu_back_mem.mem-machine_mem.OdGYet (deleted)
> > 
> > -------------------- inode --------------------
> > i_flags:               0x0
> > i_ino:                 67333199
> > i_size:                32212254720
> > 
> > ----------------- address_space ----------------
> > flags:                 040
> > invalidate_lock
> >   count:               256
> >   owner:               0xffff07fff6e759c1
> >     pid: 4496  task: qemu-kvm
> >   wait_list.next:      0xffff07ffa20422e0
> >   wait_list.prev:      0xffff07ffa20422e0
> > 
> > -------------------- xarray --------------------
> > entry[0]:       0xffff080f7eda0002
> > shift:          18
> > offset:         0
> > count:          2
> > nr_values:      0
> > parent:         0x0
> > slots[00]:      0xffff07ffa094546a
> > slots[01]:      0xffff07ffa1b09b22
> > 
> > entry[1]:       0xffff07ffa094546a
> > shift:          12
> > offset:         0
> > count:          20
> > nr_values:      0
> > parent:         0xffff080f7eda0000
> > slots[00]:      0xffffffc202880000
> > slots[01]:      0x2
> > 
> > entry[2]:       0xffffffc202880000
> > shift:          104
> > offset:         128
> > count:          0
> > nr_values:      0
> > parent:         0xffffffc20304c888
> > slots[00]:      0xffff08009a960000
> > slots[01]:      0x2001ffffffff
> > 
> > It seems the last xarray entry ("entry[2]") has been corrupted. "shift"
> > becomes 104 and "offset" becomes 128, which isn't reasonable.
> > It's explaining why we run into xas_split_alloc() in __filemap_add_folio()
> > 
> >         if (order > folio_order(folio))
> >                 xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
> >                                 order, gfp);
> > 
> > folio_order(folio) is likely 6 since the readahead window size on the BDI device
> > is 4MB.
> > However, @order figured from the corrupted xarray entry is much larger than 6.
> > log2(0x400000 / 0x10000) = log2(64) = 6
> > 
> > [root@ampere-mtsnow-altramax-28 ~]# uname -r
> > 6.6.0-rc2-zhenyzha+
> 
> What commit/tree?
> 
> > [root@ampere-mtsnow-altramax-28 ~]# cat
> > /sys/devices/virtual/bdi/253:0/read_ahead_kb
> > 4096
> > 
> 
> I'm confused...
> 
> -- 
> An old man doll... just what I always wanted! - Clara



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2023-09-25 15:12   ` Endless calls to xas_split_alloc() due to corrupted xarray entry Darrick J. Wong
@ 2023-09-26  7:49     ` Zhenyu Zhang
  2023-09-29 10:11       ` Gavin Shan
  0 siblings, 1 reply; 11+ messages in thread
From: Zhenyu Zhang @ 2023-09-26  7:49 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bagas Sanjaya, Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Guowen Shan, Shaoqin Huang,
	Matthew Wilcox, Chandan Babu R, Andrew Morton, Linus Torvalds

Hello Darrick,

The issue gets fixed in rc3. However, it seems not caused by commit
6d2779ecaeb56f9 because I can't reproduce the issue with rc3 and
the commit revert. I'm running 'git bisect' to nail it down. Hopefully,
I can identify the problematic commit soon.

On Mon, Sep 25, 2023 at 11:14 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, Sep 25, 2023 at 05:04:16PM +0700, Bagas Sanjaya wrote:
> > On Fri, Sep 22, 2023 at 11:56:43AM +0800, Zhenyu Zhang wrote:
> > > Hi all,
> > >
> > > we don't know how the xarray entry was corrupted. Maybe it's a known
> > > issue to community.
> > > Lets see.
> > >
> > > Contents
> > > --------
> > > 1. Problem Statement
> > > 2. The call trace
> > > 3. The captured data by bpftrace
> > >
> > >
> > > 1. Problem Statement
> > > --------------------
> > > With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
> > >     Steps:
> > >     1) System setup hugepages on host.
> > >        # echo 60 > /proc/sys/vm/nr_hugepages
> > >     2) Mount this hugepage to /mnt/kvm_hugepage.
> > >        # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
> >
> > What block device/disk image you use to format the filesystem?
> >
> > >     3) HugePages didn't leak when using non-existent mem-path.
> > >        # mkdir -p /mnt/tmp
> > >     4) Boot guest.
> > >        # /usr/libexec/qemu-kvm \
> > > ...
> > >          -m 30720 \
> > > -object '{"size": 32212254720, "mem-path": "/mnt/tmp", "qom-type":
> > > "memory-backend-file"}'  \
> > > -smp 4,maxcpus=4,cores=2,threads=1,clusters=1,sockets=2  \
> > >          -blockdev '{"node-name": "file_image1", "driver": "file",
> > > "auto-read-only": true, "discard": "unmap", "aio": "threads",
> > > "filename": "/home/kvm_autotest_root/images/back_up_4k.qcow2",
> > > "cache": {"direct": true, "no-flush": false}}' \
> > > -blockdev '{"node-name": "drive_image1", "driver": "qcow2",
> > > "read-only": false, "cache": {"direct": true, "no-flush": false},
> > > "file": "file_image1"}' \
> > > -device '{"driver": "scsi-hd", "id": "image1", "drive":
> > > "drive_image1", "write-cache": "on"}' \
> > >
> > >     5) Wait about 1 minute ------> hit Call trace
> > >
> > > 2. The call trace
> > > --------------------
> > > [   14.982751] block dm-0: the capability attribute has been deprecated.
> > > [   15.690043] PEFILE: Unsigned PE binary
> > >
> > >
> > > [   90.135676] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [   90.136629] rcu: 3-...0: (3 ticks this GP)
> > > idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=232
> > > [   90.137293] rcu: (detected by 2, t=6012 jiffies, g=2085, q=2539 ncpus=4)
> > > [   90.137796] Task dump for CPU 3:
> > > [   90.138037] task:PK-Backend      state:R  running task     stack:0
> > >    pid:2287  ppid:1      flags:0x00000202
> > > [   90.138757] Call trace:
> > > [   90.138940]  __switch_to+0xc8/0x110
> > > [   90.139203]  0xb54a54f8c5fb0700
> > >
> > > [  270.190849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  270.191722] rcu: 3-...0: (3 ticks this GP)
> > > idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=1020
> > > [  270.192405] rcu: (detected by 1, t=24018 jiffies, g=2085, q=3104 ncpus=4)
> > > [  270.192876] Task dump for CPU 3:
> > > [  270.193099] task:PK-Backend      state:R  running task     stack:0
> > >    pid:2287  ppid:1      flags:0x00000202
> > > [  270.193774] Call trace:
> > > [  270.193946]  __switch_to+0xc8/0x110
> > > [  270.194336]  0xb54a54f8c5fb0700
> > >
> > > [ 1228.068406] ------------[ cut here ]------------
> > > [ 1228.073011] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
> > > xas_split_alloc+0xf8/0x128
> > > [ 1228.080828] Modules linked in: binfmt_misc vhost_net vhost
> > > vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> > > nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
> > > nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
> > > qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
> > > ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
> > > arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
> > > crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
> > > i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
> > > i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
> > > fuse
> > > [ 1228.137630] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
> > > G        W          6.6.0-rc2-zhenyzha+ #5
>
> Please try 6.6-rc3, which doesn't have broken bit spinlocks (and hence
> corruption problems in the vfs) on arm64.
>
> (See commit 6d2779ecaeb56f9 "locking/atomic: scripts: fix fallback
> ifdeffery")
>
> --D
>
> > > [ 1228.147529] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> > > F31h (SCP: 2.10.20220810) 07/27/2022
> > > [ 1228.156820] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [ 1228.163767] pc : xas_split_alloc+0xf8/0x128
> > > [ 1228.167938] lr : __filemap_add_folio+0x33c/0x4e0
> > > [ 1228.172543] sp : ffff80008dd4f1c0
> > > [ 1228.175844] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
> > > [ 1228.182967] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
> > > [ 1228.190089] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
> > > [ 1228.197211] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
> > > [ 1228.204334] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
> > > [ 1228.211456] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
> > > [ 1228.218578] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
> > > [ 1228.225701] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
> > > [ 1228.232823] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
> > > [ 1228.239945] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> > > [ 1228.247067] Call trace:
> > > [ 1228.249500]  xas_split_alloc+0xf8/0x128
> > > [ 1228.253324]  __filemap_add_folio+0x33c/0x4e0
> > > [ 1228.257582]  filemap_add_folio+0x48/0xd0
> > > [ 1228.261493]  page_cache_ra_order+0x214/0x310
> > > [ 1228.265750]  ondemand_readahead+0x1a8/0x320
> > > [ 1228.269921]  page_cache_async_ra+0x64/0xa8
> > > [ 1228.274005]  filemap_fault+0x238/0xaa8
> > > [ 1228.277742]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> > > [ 1228.282491]  xfs_filemap_fault+0x54/0x68 [xfs]
> > > [ 1228.286979]  __do_fault+0x40/0x210
> > > [ 1228.290368]  do_cow_fault+0xf0/0x300
> > > [ 1228.293931]  do_pte_missing+0x140/0x238
> > > [ 1228.297754]  handle_pte_fault+0x100/0x160
> > > [ 1228.301751]  __handle_mm_fault+0x100/0x310
> > > [ 1228.305835]  handle_mm_fault+0x6c/0x270
> > > [ 1228.309659]  faultin_page+0x70/0x128
> > > [ 1228.313221]  __get_user_pages+0xc8/0x2d8
> > > [ 1228.317131]  get_user_pages_unlocked+0xc4/0x3b8
> > > [ 1228.321648]  hva_to_pfn+0xf8/0x468
> > > [ 1228.325037]  __gfn_to_pfn_memslot+0xa4/0xf8
> > > [ 1228.329207]  user_mem_abort+0x174/0x7e8
> > > [ 1228.333031]  kvm_handle_guest_abort+0x2dc/0x450
> > > [ 1228.337548]  handle_exit+0x70/0x1c8
> > > [ 1228.341024]  kvm_arch_vcpu_ioctl_run+0x224/0x5b8
> > > [ 1228.345628]  kvm_vcpu_ioctl+0x28c/0x9d0
> > > [ 1228.349450]  __arm64_sys_ioctl+0xa8/0xf0
> > > [ 1228.353360]  invoke_syscall.constprop.0+0x7c/0xd0
> > > [ 1228.358052]  do_el0_svc+0xb4/0xd0
> > > [ 1228.361354]  el0_svc+0x50/0x228
> > > [ 1228.364484]  el0t_64_sync_handler+0x134/0x150
> > > [ 1228.368827]  el0t_64_sync+0x17c/0x180
> > > [ 1228.372476] ---[ end trace 0000000000000000 ]---
> > > [ 1228.377124] ------------[ cut here ]------------
> > > [ 1228.381728] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
> > > xas_split_alloc+0xf8/0x128
> > > [ 1228.389546] Modules linked in: binfmt_misc vhost_net vhost
> > > vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> > > nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
> > > nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
> > > qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
> > > ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
> > > arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
> > > crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
> > > i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
> > > i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
> > > fuse
> > > [ 1228.446348] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
> > > G        W          6.6.0-rc2-zhenyzha+ #5
> > > [ 1228.456248] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> > > F31h (SCP: 2.10.20220810) 07/27/2022
> > > [ 1228.465538] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [ 1228.472486] pc : xas_split_alloc+0xf8/0x128
> > > [ 1228.476656] lr : __filemap_add_folio+0x33c/0x4e0
> > > [ 1228.481261] sp : ffff80008dd4f1c0
> > > [ 1228.484563] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
> > > [ 1228.491685] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
> > > [ 1228.498807] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
> > > [ 1228.505930] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
> > > [ 1228.513052] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
> > > [ 1228.520174] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
> > > [ 1228.527297] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
> > > [ 1228.534419] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
> > > [ 1228.541542] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
> > > [ 1228.548664] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> > > [ 1228.555786] Call trace:
> > > [ 1228.558220]  xas_split_alloc+0xf8/0x128
> > > [ 1228.562043]  __filemap_add_folio+0x33c/0x4e0
> > > [ 1228.566300]  filemap_add_folio+0x48/0xd0
> > > [ 1228.570211]  page_cache_ra_order+0x214/0x310
> > > [ 1228.574469]  ondemand_readahead+0x1a8/0x320
> > > [ 1228.578639]  page_cache_async_ra+0x64/0xa8
> > > [ 1228.582724]  filemap_fault+0x238/0xaa8
> > > [ 1228.586460]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> > > [ 1228.591210]  xfs_filemap_fault+0x54/0x68 [xfs]
> > >
> > >
> > >
> > > 3. The captured data by bpftrace
> > > (The following part is the crawl analysis of gshan@redhat.com )
> > > --------------------
> > > pid:  4475    task: qemu-kvm
> > > file: /mnt/tmp/qemu_back_mem.mem-machine_mem.OdGYet (deleted)
> > >
> > > -------------------- inode --------------------
> > > i_flags:               0x0
> > > i_ino:                 67333199
> > > i_size:                32212254720
> > >
> > > ----------------- address_space ----------------
> > > flags:                 040
> > > invalidate_lock
> > >   count:               256
> > >   owner:               0xffff07fff6e759c1
> > >     pid: 4496  task: qemu-kvm
> > >   wait_list.next:      0xffff07ffa20422e0
> > >   wait_list.prev:      0xffff07ffa20422e0
> > >
> > > -------------------- xarray --------------------
> > > entry[0]:       0xffff080f7eda0002
> > > shift:          18
> > > offset:         0
> > > count:          2
> > > nr_values:      0
> > > parent:         0x0
> > > slots[00]:      0xffff07ffa094546a
> > > slots[01]:      0xffff07ffa1b09b22
> > >
> > > entry[1]:       0xffff07ffa094546a
> > > shift:          12
> > > offset:         0
> > > count:          20
> > > nr_values:      0
> > > parent:         0xffff080f7eda0000
> > > slots[00]:      0xffffffc202880000
> > > slots[01]:      0x2
> > >
> > > entry[2]:       0xffffffc202880000
> > > shift:          104
> > > offset:         128
> > > count:          0
> > > nr_values:      0
> > > parent:         0xffffffc20304c888
> > > slots[00]:      0xffff08009a960000
> > > slots[01]:      0x2001ffffffff
> > >
> > > It seems the last xarray entry ("entry[2]") has been corrupted. "shift"
> > > becomes 104 and "offset" becomes 128, which isn't reasonable.
> > > It's explaining why we run into xas_split_alloc() in __filemap_add_folio()
> > >
> > >         if (order > folio_order(folio))
> > >                 xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
> > >                                 order, gfp);
> > >
> > > folio_order(folio) is likely 6 since the readahead window size on the BDI device
> > > is 4MB.
> > > However, @order figured from the corrupted xarray entry is much larger than 6.
> > > log2(0x400000 / 0x10000) = log2(64) = 6
> > >
> > > [root@ampere-mtsnow-altramax-28 ~]# uname -r
> > > 6.6.0-rc2-zhenyzha+
> >
> > What commit/tree?
> >
> > > [root@ampere-mtsnow-altramax-28 ~]# cat
> > > /sys/devices/virtual/bdi/253:0/read_ahead_kb
> > > 4096
> > >
> >
> > I'm confused...
> >
> > --
> > An old man doll... just what I always wanted! - Clara
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2023-09-26  7:49     ` Zhenyu Zhang
@ 2023-09-29 10:11       ` Gavin Shan
  0 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2023-09-29 10:11 UTC (permalink / raw)
  To: Zhenyu Zhang, Darrick J. Wong
  Cc: Bagas Sanjaya, Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Shaoqin Huang, Matthew Wilcox,
	Chandan Babu R, Andrew Morton, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 4219 bytes --]

Hi Zhenyu & Darrick,

On 9/26/23 17:49, Zhenyu Zhang wrote:
> 
> The issue gets fixed in rc3. However, it seems not caused by commit
> 6d2779ecaeb56f9 because I can't reproduce the issue with rc3 and
> the commit revert. I'm running 'git bisect' to nail it down. Hopefully,
> I can identify the problematic commit soon.
> 

The issue is still existing in rc3. I can even reproduce it with a program
running inside a virtual machine, where a 1GB private VMA mapped on xfs file
"/tmp/test_data" and it's populated via madvisde(buf, 1GB, MADV_POPULATE_WRITE).
The idea is to mimic QEMU's behavior. Note that the test program is put into
a memory cgroup so that memory claim happens due to the memory size limits.

I'm attaching the test program and script.

guest# uname -r
6.6.0-rc3
guest# lscpu
Architecture:           aarch64
   CPU op-mode(s):       32-bit, 64-bit
   Byte Order:           Little Endian
CPU(s):                 48
   On-line CPU(s) list:  0-47
    :
guest# cat /proc/1/smaps | grep KernelPage | head -n 1
KernelPageSize:       64 kB


[  485.002792] WARNING: CPU: 39 PID: 2370 at lib/xarray.c:1010 xas_split_alloc+0xf8/0x128
[  485.003389] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vfat fat virtio_balloon drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce virtio_net net_failover sha256_arm64 virtio_blk failover sha1_ce virtio_console virtio_mmio
[  485.006058] CPU: 39 PID: 2370 Comm: test Kdump: loaded Tainted: G        W          6.6.0-rc3-gavin+ #3
[  485.006763] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230524-3.el9 05/24/2023
[  485.007365] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  485.007887] pc : xas_split_alloc+0xf8/0x128
[  485.008205] lr : __filemap_add_folio+0x33c/0x4e0
[  485.008550] sp : ffff80008e6af4f0
[  485.008802] x29: ffff80008e6af4f0 x28: ffffcc3538ea8d00 x27: 0000000000000001
[  485.009347] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
[  485.009878] x23: ffff80008e6af5a0 x22: 000008c0b0001d01 x21: 0000000000000000
[  485.010411] x20: ffffffc001fb8bc0 x19: 000000000000000d x18: 0000000000000014
[  485.010948] x17: 00000000e8438802 x16: 00000000831d1d75 x15: ffffcc3538465968
[  485.011487] x14: ffffcc3538465380 x13: ffffcc353812668c x12: ffffcc3538126584
[  485.012019] x11: ffffcc353811160c x10: ffffcc3538e01054 x9 : ffffcc3538dfc1bc
[  485.012557] x8 : ffff80008e6af4f0 x7 : ffff0000e0b706d8 x6 : ffff80008e6af4f0
[  485.013089] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000012c40
[  485.013614] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
[  485.014139] Call trace:
[  485.014321]  xas_split_alloc+0xf8/0x128
[  485.014613]  __filemap_add_folio+0x33c/0x4e0
[  485.014934]  filemap_add_folio+0x48/0xd0
[  485.015227]  page_cache_ra_unbounded+0xf0/0x1f0
[  485.015573]  page_cache_ra_order+0x8c/0x310
[  485.015889]  filemap_fault+0x67c/0xaa8
[  485.016167]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
[  485.016588]  xfs_filemap_fault+0x54/0x68 [xfs]
[  485.016981]  __do_fault+0x40/0x210
[  485.017233]  do_cow_fault+0xf0/0x300
[  485.017496]  do_pte_missing+0x140/0x238
[  485.017782]  handle_pte_fault+0x100/0x160
[  485.018076]  __handle_mm_fault+0x100/0x310
[  485.018385]  handle_mm_fault+0x6c/0x270
[  485.018676]  faultin_page+0x70/0x128
[  485.018948]  __get_user_pages+0xc8/0x2d8
[  485.019252]  faultin_vma_page_range+0x64/0x98
[  485.019576]  madvise_populate+0xb4/0x1f8
[  485.019870]  madvise_vma_behavior+0x208/0x6a0
[  485.020195]  do_madvise.part.0+0x150/0x430
[  485.020501]  __arm64_sys_madvise+0x64/0x78
[  485.020806]  invoke_syscall.constprop.0+0x7c/0xd0
[  485.021163]  do_el0_svc+0xb4/0xd0
[  485.021413]  el0_svc+0x50/0x228
[  485.021646]  el0t_64_sync_handler+0x134/0x150
[  485.021972]  el0t_64_sync+0x17c/0x180

After this, the warning messages won't be raised any more after the clean page
caches are dropped by the following command. The test program either completes
or runs into OOM killer.

guest# echo 1 > /proc/sys/vm/drop_caches

[...]

Thanks,
Gavin

[-- Attachment #2: test.sh --]
[-- Type: application/x-shellscript, Size: 609 bytes --]

[-- Attachment #3: test.c --]
[-- Type: text/x-csrc, Size: 2000 bytes --]

// SPDX-License-Identifier: GPL-2.0-or-later
/*
 * Copyright (C) 2023  Red Hat, Inc.
 *
 * Author: Gavin Shan <gshan@redhat.com>
 *
 * Attempt to reproduce the xfs issue that Zhenyu observed.
 * The idea is to mimic QEMU's behavior to have private
 * mmap'ed VMA on xfs file (/tmp/test_data). The program
 * should be put into cgroup where the memory limit is set,
 * so that memory claim is enforced.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/mman.h>

#define TEST_FILENAME	"/tmp/test_data"
#define TEST_MEM_SIZE	0x40000000

static void hold(int argc, const char *desc)
{
	int opt;

	if (argc <= 1)
		return;

	fprintf(stdout, "%s\n", desc);
	scanf("%c", &opt);
}
	
int main(int argc, char **argv)
{
	int fd = 0;
	void *buf = (void *)-1, *p;
	int pgsize = getpagesize();
	int ret;

	fd = open(TEST_FILENAME, O_RDWR);
	if (fd < 0) {
		fprintf(stderr, "Unable to open <%s>\n", TEST_FILENAME);
		return -EIO;
	}

	hold(argc, "Press any key to mmap...\n");
	buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ | PROT_WRITE,
		   MAP_PRIVATE, fd, 0);
	if (buf == (void *)-1) {
		fprintf(stderr, "Unable to mmap <%s>\n", TEST_FILENAME);
		goto cleanup;
	}

	fprintf(stdout, "mmap'ed at 0x%p\n", buf);
	ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE);
        if (ret) {
		fprintf(stderr, "Unable to madvise(MADV_HUGEPAGE)\n");
		goto cleanup;
	}

        hold(argc, "Press any key to populate...");
        fprintf(stdout, "Populate area at 0x%lx, size=0x%x\n",
                (unsigned long)buf, TEST_MEM_SIZE);
	ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_WRITE);
	if (ret) {
		fprintf(stderr, "Unable to madvise(MADV_POPULATE_WRITE)\n");
		goto cleanup;
	}
	
cleanup:
	hold(argc, "Press any key to munmap...");
	if (buf != (void *)-1)
		munmap(buf, TEST_MEM_SIZE);
	hold(argc, "Press any key to close...");
	if (fd > 0)
		close(fd);

	hold(argc, "Press any key to exit...");
	return 0;
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
       [not found] ` <ZRFbIJH47RkQuDid@debian.me>
  2023-09-25 15:12   ` Endless calls to xas_split_alloc() due to corrupted xarray entry Darrick J. Wong
@ 2023-09-29 19:17   ` Matthew Wilcox
  2023-09-30  2:12     ` Gavin Shan
  1 sibling, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2023-09-29 19:17 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Zhenyu Zhang, Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Guowen Shan, Shaoqin Huang,
	Chandan Babu R, Darrick J. Wong, Andrew Morton, Linus Torvalds

On Mon, Sep 25, 2023 at 05:04:16PM +0700, Bagas Sanjaya wrote:
> On Fri, Sep 22, 2023 at 11:56:43AM +0800, Zhenyu Zhang wrote:
> > Hi all,
> > 
> > we don't know how the xarray entry was corrupted. Maybe it's a known
> > issue to community.
> > Lets see.
> > 
> > Contents
> > --------
> > 1. Problem Statement
> > 2. The call trace
> > 3. The captured data by bpftrace
> > 
> > 
> > 1. Problem Statement
> > --------------------
> > With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
> >     Steps:
> >     1) System setup hugepages on host.
> >        # echo 60 > /proc/sys/vm/nr_hugepages
> >     2) Mount this hugepage to /mnt/kvm_hugepage.
> >        # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
> 
> What block device/disk image you use to format the filesystem?

It's hugetlbfs, Bagas.

> >     3) HugePages didn't leak when using non-existent mem-path.
> >        # mkdir -p /mnt/tmp
> >     4) Boot guest.
> >        # /usr/libexec/qemu-kvm \
> > ...
> >          -m 30720 \
> > -object '{"size": 32212254720, "mem-path": "/mnt/tmp", "qom-type":
> > "memory-backend-file"}'  \
> > -smp 4,maxcpus=4,cores=2,threads=1,clusters=1,sockets=2  \
> >          -blockdev '{"node-name": "file_image1", "driver": "file",
> > "auto-read-only": true, "discard": "unmap", "aio": "threads",
> > "filename": "/home/kvm_autotest_root/images/back_up_4k.qcow2",
> > "cache": {"direct": true, "no-flush": false}}' \
> > -blockdev '{"node-name": "drive_image1", "driver": "qcow2",
> > "read-only": false, "cache": {"direct": true, "no-flush": false},
> > "file": "file_image1"}' \
> > -device '{"driver": "scsi-hd", "id": "image1", "drive":
> > "drive_image1", "write-cache": "on"}' \
> > 
> >     5) Wait about 1 minute ------> hit Call trace
> > 
> > 2. The call trace
> > --------------------
> > [   14.982751] block dm-0: the capability attribute has been deprecated.
> > [   15.690043] PEFILE: Unsigned PE binary
> > 
> > 
> > [   90.135676] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [   90.136629] rcu: 3-...0: (3 ticks this GP)
> > idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=232
> > [   90.137293] rcu: (detected by 2, t=6012 jiffies, g=2085, q=2539 ncpus=4)
> > [   90.137796] Task dump for CPU 3:
> > [   90.138037] task:PK-Backend      state:R  running task     stack:0
> >    pid:2287  ppid:1      flags:0x00000202
> > [   90.138757] Call trace:
> > [   90.138940]  __switch_to+0xc8/0x110
> > [   90.139203]  0xb54a54f8c5fb0700
> > 
> > [  270.190849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  270.191722] rcu: 3-...0: (3 ticks this GP)
> > idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=1020
> > [  270.192405] rcu: (detected by 1, t=24018 jiffies, g=2085, q=3104 ncpus=4)
> > [  270.192876] Task dump for CPU 3:
> > [  270.193099] task:PK-Backend      state:R  running task     stack:0
> >    pid:2287  ppid:1      flags:0x00000202
> > [  270.193774] Call trace:
> > [  270.193946]  __switch_to+0xc8/0x110
> > [  270.194336]  0xb54a54f8c5fb0700
> > 
> > [ 1228.068406] ------------[ cut here ]------------
> > [ 1228.073011] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
> > xas_split_alloc+0xf8/0x128
> > [ 1228.080828] Modules linked in: binfmt_misc vhost_net vhost
> > vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> > nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
> > qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
> > ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
> > arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
> > crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
> > i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
> > i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
> > fuse
> > [ 1228.137630] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
> > G        W          6.6.0-rc2-zhenyzha+ #5
> > [ 1228.147529] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> > F31h (SCP: 2.10.20220810) 07/27/2022
> > [ 1228.156820] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ 1228.163767] pc : xas_split_alloc+0xf8/0x128
> > [ 1228.167938] lr : __filemap_add_folio+0x33c/0x4e0
> > [ 1228.172543] sp : ffff80008dd4f1c0
> > [ 1228.175844] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
> > [ 1228.182967] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
> > [ 1228.190089] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
> > [ 1228.197211] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
> > [ 1228.204334] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
> > [ 1228.211456] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
> > [ 1228.218578] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
> > [ 1228.225701] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
> > [ 1228.232823] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
> > [ 1228.239945] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> > [ 1228.247067] Call trace:
> > [ 1228.249500]  xas_split_alloc+0xf8/0x128
> > [ 1228.253324]  __filemap_add_folio+0x33c/0x4e0
> > [ 1228.257582]  filemap_add_folio+0x48/0xd0
> > [ 1228.261493]  page_cache_ra_order+0x214/0x310
> > [ 1228.265750]  ondemand_readahead+0x1a8/0x320
> > [ 1228.269921]  page_cache_async_ra+0x64/0xa8
> > [ 1228.274005]  filemap_fault+0x238/0xaa8
> > [ 1228.277742]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> > [ 1228.282491]  xfs_filemap_fault+0x54/0x68 [xfs]

This is interesting.  This path has nothing to do with the hugetlbfs
filesystem you've created up above.  And, just to be clear, this is
on the host, not in the guest, right?

> > [ 1228.377124] ------------[ cut here ]------------
> > [ 1228.381728] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
> > xas_split_alloc+0xf8/0x128
> > [ 1228.389546] Modules linked in: binfmt_misc vhost_net vhost
> > vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> > nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
> > qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
> > ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
> > arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
> > crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
> > i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
> > i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
> > fuse
> > [ 1228.446348] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
> > G        W          6.6.0-rc2-zhenyzha+ #5
> > [ 1228.456248] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> > F31h (SCP: 2.10.20220810) 07/27/2022
> > [ 1228.465538] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ 1228.472486] pc : xas_split_alloc+0xf8/0x128
> > [ 1228.476656] lr : __filemap_add_folio+0x33c/0x4e0
> > [ 1228.481261] sp : ffff80008dd4f1c0
> > [ 1228.484563] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
> > [ 1228.491685] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
> > [ 1228.498807] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
> > [ 1228.505930] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
> > [ 1228.513052] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
> > [ 1228.520174] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
> > [ 1228.527297] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
> > [ 1228.534419] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
> > [ 1228.541542] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
> > [ 1228.548664] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> > [ 1228.555786] Call trace:
> > [ 1228.558220]  xas_split_alloc+0xf8/0x128
> > [ 1228.562043]  __filemap_add_folio+0x33c/0x4e0
> > [ 1228.566300]  filemap_add_folio+0x48/0xd0
> > [ 1228.570211]  page_cache_ra_order+0x214/0x310
> > [ 1228.574469]  ondemand_readahead+0x1a8/0x320
> > [ 1228.578639]  page_cache_async_ra+0x64/0xa8
> > [ 1228.582724]  filemap_fault+0x238/0xaa8
> > [ 1228.586460]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> > [ 1228.591210]  xfs_filemap_fault+0x54/0x68 [xfs]
> > 
> > 
> > 
> > 3. The captured data by bpftrace
> > (The following part is the crawl analysis of gshan@redhat.com )
> > --------------------
> > pid:  4475    task: qemu-kvm
> > file: /mnt/tmp/qemu_back_mem.mem-machine_mem.OdGYet (deleted)
> > 
> > -------------------- inode --------------------
> > i_flags:               0x0
> > i_ino:                 67333199
> > i_size:                32212254720
> > 
> > ----------------- address_space ----------------
> > flags:                 040
> > invalidate_lock
> >   count:               256
> >   owner:               0xffff07fff6e759c1
> >     pid: 4496  task: qemu-kvm
> >   wait_list.next:      0xffff07ffa20422e0
> >   wait_list.prev:      0xffff07ffa20422e0
> > 
> > -------------------- xarray --------------------
> > entry[0]:       0xffff080f7eda0002
> > shift:          18
> > offset:         0
> > count:          2
> > nr_values:      0
> > parent:         0x0
> > slots[00]:      0xffff07ffa094546a
> > slots[01]:      0xffff07ffa1b09b22
> > 
> > entry[1]:       0xffff07ffa094546a
> > shift:          12
> > offset:         0
> > count:          20
> > nr_values:      0
> > parent:         0xffff080f7eda0000
> > slots[00]:      0xffffffc202880000
> > slots[01]:      0x2
> > 
> > entry[2]:       0xffffffc202880000
> > shift:          104
> > offset:         128
> > count:          0
> > nr_values:      0
> > parent:         0xffffffc20304c888
> > slots[00]:      0xffff08009a960000
> > slots[01]:      0x2001ffffffff
> > 
> > It seems the last xarray entry ("entry[2]") has been corrupted. "shift"
> > becomes 104 and "offset" becomes 128, which isn't reasonable.

Um, no.  Whatever tool you're using doesn't understand how XArrays work.
Fortunately, I wrote xa_dump() which does.  entry[2] does not have bit
1 set, so it is an entry, not a node.  You're dereferencing a pointer to
a folio as if it's a pointer to a node, so no wonder it looks corrupted
to you.  From this, we know that the folio is at least order-6, and it's
probably order-9 (because I bet this VMA has the VM_HUGEPAGE flag set,
and we're doing PMD-sized faults).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2023-09-29 19:17   ` Matthew Wilcox
@ 2023-09-30  2:12     ` Gavin Shan
  2024-06-19  9:45       ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Shan @ 2023-09-30  2:12 UTC (permalink / raw)
  To: Matthew Wilcox, Bagas Sanjaya
  Cc: Zhenyu Zhang, Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Shaoqin Huang, Chandan Babu R,
	Darrick J. Wong, Andrew Morton, Linus Torvalds

Hi Matthew,

On 9/30/23 05:17, Matthew Wilcox wrote:
> On Mon, Sep 25, 2023 at 05:04:16PM +0700, Bagas Sanjaya wrote:
>> On Fri, Sep 22, 2023 at 11:56:43AM +0800, Zhenyu Zhang wrote:
>>> Hi all,
>>>
>>> we don't know how the xarray entry was corrupted. Maybe it's a known
>>> issue to community.
>>> Lets see.
>>>
>>> Contents
>>> --------
>>> 1. Problem Statement
>>> 2. The call trace
>>> 3. The captured data by bpftrace
>>>
>>>
>>> 1. Problem Statement
>>> --------------------
>>> With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
>>>      Steps:
>>>      1) System setup hugepages on host.
>>>         # echo 60 > /proc/sys/vm/nr_hugepages
>>>      2) Mount this hugepage to /mnt/kvm_hugepage.
>>>         # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
>>
>> What block device/disk image you use to format the filesystem?
> 
> It's hugetlbfs, Bagas.
> 

The hugetlbfs pages are reserved, but never used. In this way, the available
system memory is reduced. So it's same affect as to "mem=xxx" boot parameter.

>>>      3) HugePages didn't leak when using non-existent mem-path.
>>>         # mkdir -p /mnt/tmp
>>>      4) Boot guest.
>>>         # /usr/libexec/qemu-kvm \
>>> ...
>>>           -m 30720 \
>>> -object '{"size": 32212254720, "mem-path": "/mnt/tmp", "qom-type":
>>> "memory-backend-file"}'  \
>>> -smp 4,maxcpus=4,cores=2,threads=1,clusters=1,sockets=2  \
>>>           -blockdev '{"node-name": "file_image1", "driver": "file",
>>> "auto-read-only": true, "discard": "unmap", "aio": "threads",
>>> "filename": "/home/kvm_autotest_root/images/back_up_4k.qcow2",
>>> "cache": {"direct": true, "no-flush": false}}' \
>>> -blockdev '{"node-name": "drive_image1", "driver": "qcow2",
>>> "read-only": false, "cache": {"direct": true, "no-flush": false},
>>> "file": "file_image1"}' \
>>> -device '{"driver": "scsi-hd", "id": "image1", "drive":
>>> "drive_image1", "write-cache": "on"}' \
>>>
>>>      5) Wait about 1 minute ------> hit Call trace
>>>
>>> 2. The call trace
>>> --------------------
>>> [   14.982751] block dm-0: the capability attribute has been deprecated.
>>> [   15.690043] PEFILE: Unsigned PE binary
>>>
>>>
>>> [   90.135676] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>> [   90.136629] rcu: 3-...0: (3 ticks this GP)
>>> idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=232
>>> [   90.137293] rcu: (detected by 2, t=6012 jiffies, g=2085, q=2539 ncpus=4)
>>> [   90.137796] Task dump for CPU 3:
>>> [   90.138037] task:PK-Backend      state:R  running task     stack:0
>>>     pid:2287  ppid:1      flags:0x00000202
>>> [   90.138757] Call trace:
>>> [   90.138940]  __switch_to+0xc8/0x110
>>> [   90.139203]  0xb54a54f8c5fb0700
>>>
>>> [  270.190849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>> [  270.191722] rcu: 3-...0: (3 ticks this GP)
>>> idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=1020
>>> [  270.192405] rcu: (detected by 1, t=24018 jiffies, g=2085, q=3104 ncpus=4)
>>> [  270.192876] Task dump for CPU 3:
>>> [  270.193099] task:PK-Backend      state:R  running task     stack:0
>>>     pid:2287  ppid:1      flags:0x00000202
>>> [  270.193774] Call trace:
>>> [  270.193946]  __switch_to+0xc8/0x110
>>> [  270.194336]  0xb54a54f8c5fb0700
>>>
>>> [ 1228.068406] ------------[ cut here ]------------
>>> [ 1228.073011] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
>>> xas_split_alloc+0xf8/0x128
>>> [ 1228.080828] Modules linked in: binfmt_misc vhost_net vhost
>>> vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
>>> nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
>>> nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
>>> qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
>>> ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
>>> arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
>>> crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
>>> i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>> fuse
>>> [ 1228.137630] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
>>> G        W          6.6.0-rc2-zhenyzha+ #5
>>> [ 1228.147529] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
>>> F31h (SCP: 2.10.20220810) 07/27/2022
>>> [ 1228.156820] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [ 1228.163767] pc : xas_split_alloc+0xf8/0x128
>>> [ 1228.167938] lr : __filemap_add_folio+0x33c/0x4e0
>>> [ 1228.172543] sp : ffff80008dd4f1c0
>>> [ 1228.175844] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
>>> [ 1228.182967] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
>>> [ 1228.190089] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
>>> [ 1228.197211] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
>>> [ 1228.204334] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
>>> [ 1228.211456] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
>>> [ 1228.218578] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
>>> [ 1228.225701] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
>>> [ 1228.232823] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
>>> [ 1228.239945] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
>>> [ 1228.247067] Call trace:
>>> [ 1228.249500]  xas_split_alloc+0xf8/0x128
>>> [ 1228.253324]  __filemap_add_folio+0x33c/0x4e0
>>> [ 1228.257582]  filemap_add_folio+0x48/0xd0
>>> [ 1228.261493]  page_cache_ra_order+0x214/0x310
>>> [ 1228.265750]  ondemand_readahead+0x1a8/0x320
>>> [ 1228.269921]  page_cache_async_ra+0x64/0xa8
>>> [ 1228.274005]  filemap_fault+0x238/0xaa8
>>> [ 1228.277742]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
>>> [ 1228.282491]  xfs_filemap_fault+0x54/0x68 [xfs]
> 
> This is interesting.  This path has nothing to do with the hugetlbfs
> filesystem you've created up above.  And, just to be clear, this is
> on the host, not in the guest, right?
> 

Correct, the backtrce is seen on the host. The XFS file is used as backup
memory to the guest. QEMU maps the entire file as PRIVATE and the VMA has
been advised to huge page by madvise(MADV_HUGEPAGE). When the guest is
started, QEMU calls madvise(MADV_POPULATE_WRITE) to populate the VMA. Since
the VMA is private, there are copy-on-write page fault happening on
calling to madvise(MADV_POPULATE_WRITE). In the page fault handler,
there are readahead reuqests to be processed.

The backtrace, originating from WARN_ON(), is triggered when attempt to
allocate a huge page fails in the middle of readahead. In this specific
case, we're falling back to order-0 with attempt to modify the xarray
for this. Unfortunately, it's reported this particular scenario isn't
supported by xas_split_alloc().


>>> [ 1228.377124] ------------[ cut here ]------------
>>> [ 1228.381728] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
>>> xas_split_alloc+0xf8/0x128
>>> [ 1228.389546] Modules linked in: binfmt_misc vhost_net vhost
>>> vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
>>> nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
>>> nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
>>> qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
>>> ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
>>> arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
>>> crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
>>> i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>> fuse
>>> [ 1228.446348] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
>>> G        W          6.6.0-rc2-zhenyzha+ #5
>>> [ 1228.456248] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
>>> F31h (SCP: 2.10.20220810) 07/27/2022
>>> [ 1228.465538] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [ 1228.472486] pc : xas_split_alloc+0xf8/0x128
>>> [ 1228.476656] lr : __filemap_add_folio+0x33c/0x4e0
>>> [ 1228.481261] sp : ffff80008dd4f1c0
>>> [ 1228.484563] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
>>> [ 1228.491685] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
>>> [ 1228.498807] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
>>> [ 1228.505930] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
>>> [ 1228.513052] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
>>> [ 1228.520174] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
>>> [ 1228.527297] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
>>> [ 1228.534419] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
>>> [ 1228.541542] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
>>> [ 1228.548664] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
>>> [ 1228.555786] Call trace:
>>> [ 1228.558220]  xas_split_alloc+0xf8/0x128
>>> [ 1228.562043]  __filemap_add_folio+0x33c/0x4e0
>>> [ 1228.566300]  filemap_add_folio+0x48/0xd0
>>> [ 1228.570211]  page_cache_ra_order+0x214/0x310
>>> [ 1228.574469]  ondemand_readahead+0x1a8/0x320
>>> [ 1228.578639]  page_cache_async_ra+0x64/0xa8
>>> [ 1228.582724]  filemap_fault+0x238/0xaa8
>>> [ 1228.586460]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
>>> [ 1228.591210]  xfs_filemap_fault+0x54/0x68 [xfs]
>>>
>>>
>>>
>>> 3. The captured data by bpftrace
>>> (The following part is the crawl analysis of gshan@redhat.com )
>>> --------------------
>>> pid:  4475    task: qemu-kvm
>>> file: /mnt/tmp/qemu_back_mem.mem-machine_mem.OdGYet (deleted)
>>>
>>> -------------------- inode --------------------
>>> i_flags:               0x0
>>> i_ino:                 67333199
>>> i_size:                32212254720
>>>
>>> ----------------- address_space ----------------
>>> flags:                 040
>>> invalidate_lock
>>>    count:               256
>>>    owner:               0xffff07fff6e759c1
>>>      pid: 4496  task: qemu-kvm
>>>    wait_list.next:      0xffff07ffa20422e0
>>>    wait_list.prev:      0xffff07ffa20422e0
>>>
>>> -------------------- xarray --------------------
>>> entry[0]:       0xffff080f7eda0002
>>> shift:          18
>>> offset:         0
>>> count:          2
>>> nr_values:      0
>>> parent:         0x0
>>> slots[00]:      0xffff07ffa094546a
>>> slots[01]:      0xffff07ffa1b09b22
>>>
>>> entry[1]:       0xffff07ffa094546a
>>> shift:          12
>>> offset:         0
>>> count:          20
>>> nr_values:      0
>>> parent:         0xffff080f7eda0000
>>> slots[00]:      0xffffffc202880000
>>> slots[01]:      0x2
>>>
>>> entry[2]:       0xffffffc202880000
>>> shift:          104
>>> offset:         128
>>> count:          0
>>> nr_values:      0
>>> parent:         0xffffffc20304c888
>>> slots[00]:      0xffff08009a960000
>>> slots[01]:      0x2001ffffffff
>>>
>>> It seems the last xarray entry ("entry[2]") has been corrupted. "shift"
>>> becomes 104 and "offset" becomes 128, which isn't reasonable.
> 
> Um, no.  Whatever tool you're using doesn't understand how XArrays work.
> Fortunately, I wrote xa_dump() which does.  entry[2] does not have bit
> 1 set, so it is an entry, not a node.  You're dereferencing a pointer to
> a folio as if it's a pointer to a node, so no wonder it looks corrupted
> to you.  From this, we know that the folio is at least order-6, and it's
> probably order-9 (because I bet this VMA has the VM_HUGEPAGE flag set,
> and we're doing PMD-sized faults).
> 

Indeed, entry[2] is a entry instead of a node, deferencing a folio.
bpftrace was used to dump the xarray. you're correct that the VMA has
flag VM_HUGEPAGE, set by madvise(MADV_HUGEPAGE). The order returned by
xas_get_order() is 13, passed to xas_split_alloc().

/*
  * xas->xa_shift    = 0
  * XA_CHUNK_SHIFT   = 6
  * order            = 13      (512MB huge page size vs 64KB base page size)
  */
void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
                 gfp_t gfp)
{
         unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
         unsigned int mask = xas->xa_sibs;

         /* XXX: no support for splitting really large entries yet */
         if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT < order))
                 goto nomem;
         :
}

I've shared a simplified reproducer in another reply. In that reproducer,
we just run a program in a memroy cgroup, where the memory space is limited.
The program mimics what QEMU does. The similar backtrace can be seen.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2023-09-30  2:12     ` Gavin Shan
@ 2024-06-19  9:45       ` David Hildenbrand
  2024-06-19 14:31         ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-06-19  9:45 UTC (permalink / raw)
  To: Gavin Shan, Matthew Wilcox, Bagas Sanjaya
  Cc: Zhenyu Zhang, Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Shaoqin Huang, Chandan Babu R,
	Darrick J. Wong, Andrew Morton, Linus Torvalds

On 30.09.23 04:12, Gavin Shan wrote:
> Hi Matthew,
> 
> On 9/30/23 05:17, Matthew Wilcox wrote:
>> On Mon, Sep 25, 2023 at 05:04:16PM +0700, Bagas Sanjaya wrote:
>>> On Fri, Sep 22, 2023 at 11:56:43AM +0800, Zhenyu Zhang wrote:
>>>> Hi all,
>>>>
>>>> we don't know how the xarray entry was corrupted. Maybe it's a known
>>>> issue to community.
>>>> Lets see.
>>>>
>>>> Contents
>>>> --------
>>>> 1. Problem Statement
>>>> 2. The call trace
>>>> 3. The captured data by bpftrace
>>>>
>>>>
>>>> 1. Problem Statement
>>>> --------------------
>>>> With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
>>>>       Steps:
>>>>       1) System setup hugepages on host.
>>>>          # echo 60 > /proc/sys/vm/nr_hugepages
>>>>       2) Mount this hugepage to /mnt/kvm_hugepage.
>>>>          # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
>>>
>>> What block device/disk image you use to format the filesystem?
>>
>> It's hugetlbfs, Bagas.
>>
> 
> The hugetlbfs pages are reserved, but never used. In this way, the available
> system memory is reduced. So it's same affect as to "mem=xxx" boot parameter.
> 
>>>>       3) HugePages didn't leak when using non-existent mem-path.
>>>>          # mkdir -p /mnt/tmp
>>>>       4) Boot guest.
>>>>          # /usr/libexec/qemu-kvm \
>>>> ...
>>>>            -m 30720 \
>>>> -object '{"size": 32212254720, "mem-path": "/mnt/tmp", "qom-type":
>>>> "memory-backend-file"}'  \
>>>> -smp 4,maxcpus=4,cores=2,threads=1,clusters=1,sockets=2  \
>>>>            -blockdev '{"node-name": "file_image1", "driver": "file",
>>>> "auto-read-only": true, "discard": "unmap", "aio": "threads",
>>>> "filename": "/home/kvm_autotest_root/images/back_up_4k.qcow2",
>>>> "cache": {"direct": true, "no-flush": false}}' \
>>>> -blockdev '{"node-name": "drive_image1", "driver": "qcow2",
>>>> "read-only": false, "cache": {"direct": true, "no-flush": false},
>>>> "file": "file_image1"}' \
>>>> -device '{"driver": "scsi-hd", "id": "image1", "drive":
>>>> "drive_image1", "write-cache": "on"}' \
>>>>
>>>>       5) Wait about 1 minute ------> hit Call trace
>>>>
>>>> 2. The call trace
>>>> --------------------
>>>> [   14.982751] block dm-0: the capability attribute has been deprecated.
>>>> [   15.690043] PEFILE: Unsigned PE binary
>>>>
>>>>
>>>> [   90.135676] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>> [   90.136629] rcu: 3-...0: (3 ticks this GP)
>>>> idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=232
>>>> [   90.137293] rcu: (detected by 2, t=6012 jiffies, g=2085, q=2539 ncpus=4)
>>>> [   90.137796] Task dump for CPU 3:
>>>> [   90.138037] task:PK-Backend      state:R  running task     stack:0
>>>>      pid:2287  ppid:1      flags:0x00000202
>>>> [   90.138757] Call trace:
>>>> [   90.138940]  __switch_to+0xc8/0x110
>>>> [   90.139203]  0xb54a54f8c5fb0700
>>>>
>>>> [  270.190849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>> [  270.191722] rcu: 3-...0: (3 ticks this GP)
>>>> idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=1020
>>>> [  270.192405] rcu: (detected by 1, t=24018 jiffies, g=2085, q=3104 ncpus=4)
>>>> [  270.192876] Task dump for CPU 3:
>>>> [  270.193099] task:PK-Backend      state:R  running task     stack:0
>>>>      pid:2287  ppid:1      flags:0x00000202
>>>> [  270.193774] Call trace:
>>>> [  270.193946]  __switch_to+0xc8/0x110
>>>> [  270.194336]  0xb54a54f8c5fb0700
>>>>
>>>> [ 1228.068406] ------------[ cut here ]------------
>>>> [ 1228.073011] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
>>>> xas_split_alloc+0xf8/0x128
>>>> [ 1228.080828] Modules linked in: binfmt_misc vhost_net vhost
>>>> vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
>>>> nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
>>>> nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
>>>> qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
>>>> ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
>>>> arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
>>>> crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
>>>> i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
>>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>>> fuse
>>>> [ 1228.137630] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
>>>> G        W          6.6.0-rc2-zhenyzha+ #5
>>>> [ 1228.147529] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
>>>> F31h (SCP: 2.10.20220810) 07/27/2022
>>>> [ 1228.156820] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> [ 1228.163767] pc : xas_split_alloc+0xf8/0x128
>>>> [ 1228.167938] lr : __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.172543] sp : ffff80008dd4f1c0
>>>> [ 1228.175844] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
>>>> [ 1228.182967] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
>>>> [ 1228.190089] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
>>>> [ 1228.197211] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
>>>> [ 1228.204334] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
>>>> [ 1228.211456] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
>>>> [ 1228.218578] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
>>>> [ 1228.225701] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
>>>> [ 1228.232823] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
>>>> [ 1228.239945] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
>>>> [ 1228.247067] Call trace:
>>>> [ 1228.249500]  xas_split_alloc+0xf8/0x128
>>>> [ 1228.253324]  __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.257582]  filemap_add_folio+0x48/0xd0
>>>> [ 1228.261493]  page_cache_ra_order+0x214/0x310
>>>> [ 1228.265750]  ondemand_readahead+0x1a8/0x320
>>>> [ 1228.269921]  page_cache_async_ra+0x64/0xa8
>>>> [ 1228.274005]  filemap_fault+0x238/0xaa8
>>>> [ 1228.277742]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
>>>> [ 1228.282491]  xfs_filemap_fault+0x54/0x68 [xfs]
>>
>> This is interesting.  This path has nothing to do with the hugetlbfs
>> filesystem you've created up above.  And, just to be clear, this is
>> on the host, not in the guest, right?
>>
> 
> Correct, the backtrce is seen on the host. The XFS file is used as backup
> memory to the guest. QEMU maps the entire file as PRIVATE and the VMA has
> been advised to huge page by madvise(MADV_HUGEPAGE). When the guest is
> started, QEMU calls madvise(MADV_POPULATE_WRITE) to populate the VMA. Since
> the VMA is private, there are copy-on-write page fault happening on
> calling to madvise(MADV_POPULATE_WRITE). In the page fault handler,
> there are readahead reuqests to be processed.
> 
> The backtrace, originating from WARN_ON(), is triggered when attempt to
> allocate a huge page fails in the middle of readahead. In this specific
> case, we're falling back to order-0 with attempt to modify the xarray
> for this. Unfortunately, it's reported this particular scenario isn't
> supported by xas_split_alloc().
> 
> 
>>>> [ 1228.377124] ------------[ cut here ]------------
>>>> [ 1228.381728] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
>>>> xas_split_alloc+0xf8/0x128
>>>> [ 1228.389546] Modules linked in: binfmt_misc vhost_net vhost
>>>> vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
>>>> nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
>>>> nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
>>>> qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
>>>> ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
>>>> arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
>>>> crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
>>>> i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
>>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>>> fuse
>>>> [ 1228.446348] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
>>>> G        W          6.6.0-rc2-zhenyzha+ #5
>>>> [ 1228.456248] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
>>>> F31h (SCP: 2.10.20220810) 07/27/2022
>>>> [ 1228.465538] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> [ 1228.472486] pc : xas_split_alloc+0xf8/0x128
>>>> [ 1228.476656] lr : __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.481261] sp : ffff80008dd4f1c0
>>>> [ 1228.484563] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
>>>> [ 1228.491685] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
>>>> [ 1228.498807] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
>>>> [ 1228.505930] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
>>>> [ 1228.513052] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
>>>> [ 1228.520174] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
>>>> [ 1228.527297] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
>>>> [ 1228.534419] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
>>>> [ 1228.541542] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
>>>> [ 1228.548664] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
>>>> [ 1228.555786] Call trace:
>>>> [ 1228.558220]  xas_split_alloc+0xf8/0x128
>>>> [ 1228.562043]  __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.566300]  filemap_add_folio+0x48/0xd0
>>>> [ 1228.570211]  page_cache_ra_order+0x214/0x310
>>>> [ 1228.574469]  ondemand_readahead+0x1a8/0x320
>>>> [ 1228.578639]  page_cache_async_ra+0x64/0xa8
>>>> [ 1228.582724]  filemap_fault+0x238/0xaa8
>>>> [ 1228.586460]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
>>>> [ 1228.591210]  xfs_filemap_fault+0x54/0x68 [xfs]
>>>>
>>>>
>>>>
>>>> 3. The captured data by bpftrace
>>>> (The following part is the crawl analysis of gshan@redhat.com )
>>>> --------------------
>>>> pid:  4475    task: qemu-kvm
>>>> file: /mnt/tmp/qemu_back_mem.mem-machine_mem.OdGYet (deleted)
>>>>
>>>> -------------------- inode --------------------
>>>> i_flags:               0x0
>>>> i_ino:                 67333199
>>>> i_size:                32212254720
>>>>
>>>> ----------------- address_space ----------------
>>>> flags:                 040
>>>> invalidate_lock
>>>>     count:               256
>>>>     owner:               0xffff07fff6e759c1
>>>>       pid: 4496  task: qemu-kvm
>>>>     wait_list.next:      0xffff07ffa20422e0
>>>>     wait_list.prev:      0xffff07ffa20422e0
>>>>
>>>> -------------------- xarray --------------------
>>>> entry[0]:       0xffff080f7eda0002
>>>> shift:          18
>>>> offset:         0
>>>> count:          2
>>>> nr_values:      0
>>>> parent:         0x0
>>>> slots[00]:      0xffff07ffa094546a
>>>> slots[01]:      0xffff07ffa1b09b22
>>>>
>>>> entry[1]:       0xffff07ffa094546a
>>>> shift:          12
>>>> offset:         0
>>>> count:          20
>>>> nr_values:      0
>>>> parent:         0xffff080f7eda0000
>>>> slots[00]:      0xffffffc202880000
>>>> slots[01]:      0x2
>>>>
>>>> entry[2]:       0xffffffc202880000
>>>> shift:          104
>>>> offset:         128
>>>> count:          0
>>>> nr_values:      0
>>>> parent:         0xffffffc20304c888
>>>> slots[00]:      0xffff08009a960000
>>>> slots[01]:      0x2001ffffffff
>>>>
>>>> It seems the last xarray entry ("entry[2]") has been corrupted. "shift"
>>>> becomes 104 and "offset" becomes 128, which isn't reasonable.
>>
>> Um, no.  Whatever tool you're using doesn't understand how XArrays work.
>> Fortunately, I wrote xa_dump() which does.  entry[2] does not have bit
>> 1 set, so it is an entry, not a node.  You're dereferencing a pointer to
>> a folio as if it's a pointer to a node, so no wonder it looks corrupted
>> to you.  From this, we know that the folio is at least order-6, and it's
>> probably order-9 (because I bet this VMA has the VM_HUGEPAGE flag set,
>> and we're doing PMD-sized faults).
>>
> 
> Indeed, entry[2] is a entry instead of a node, deferencing a folio.
> bpftrace was used to dump the xarray. you're correct that the VMA has
> flag VM_HUGEPAGE, set by madvise(MADV_HUGEPAGE). The order returned by
> xas_get_order() is 13, passed to xas_split_alloc().
> 
> /*
>    * xas->xa_shift    = 0
>    * XA_CHUNK_SHIFT   = 6
>    * order            = 13      (512MB huge page size vs 64KB base page size)
>    */
> void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
>                   gfp_t gfp)
> {
>           unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
>           unsigned int mask = xas->xa_sibs;
> 
>           /* XXX: no support for splitting really large entries yet */
>           if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT < order))
>                   goto nomem;
>           :
> }

Resurrecting this, because I just got aware of it.

I recall talking to Willy at some point about the problem of order-13 not
being fully supported by the pagecache right now (IIRC primiarly splitting,
which should not happen for hugetlb, which is why there it is not a
problem). And I think we discussed just blocking that for now.

So we are trying to split an order-13 entry, because we ended up
allcoating+mapping an order-13 folio previously.

That's where things got wrong, with the current limitations, maybe?

#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER

Which would translate to MAX_PAGECACHE_ORDER=13 on aarch64 with 64k.

Staring at xas_split_alloc:

	WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT < order)

I suspect we don't really support THP on systems with CONFIG_BASE_SMALL.
So we can assume XA_CHUNK_SHIFT == 6.

I guess that the maximum order we support for splitting is 12? I got confused
trying to figure that out. ;)


diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e37e16ebff7a..354cd4b7320f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -352,9 +352,12 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
   * limit the maximum allocation order to PMD size.  I'm not aware of any
   * assumptions about maximum order if THP are disabled, but 8 seems like
   * a good order (that's 1MB if you're using 4kB pages)
+ *
+ * xas_split_alloc() does not support order-13 yet, so disable that for now,
+ * which implies no 512MB THP on arm64 with 64k.
   */
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER    HPAGE_PMD_ORDER
+#define MAX_PAGECACHE_ORDER    min(HPAGE_PMD_ORDER,12)
  #else
  #define MAX_PAGECACHE_ORDER    8
  #endif


I think this does not apply to hugetlb because we never end up splitting
entries. But could this also apply to shmem + PMD THP?

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2024-06-19  9:45       ` David Hildenbrand
@ 2024-06-19 14:31         ` Matthew Wilcox
  2024-06-19 15:48           ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2024-06-19 14:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Gavin Shan, Bagas Sanjaya, Zhenyu Zhang, Linux XFS,
	Linux Filesystems Development, Linux Kernel Mailing List,
	Shaoqin Huang, Chandan Babu R, Darrick J. Wong, Andrew Morton,
	Linus Torvalds

On Wed, Jun 19, 2024 at 11:45:22AM +0200, David Hildenbrand wrote:
> I recall talking to Willy at some point about the problem of order-13 not
> being fully supported by the pagecache right now (IIRC primiarly splitting,
> which should not happen for hugetlb, which is why there it is not a
> problem). And I think we discussed just blocking that for now.
> 
> So we are trying to split an order-13 entry, because we ended up
> allcoating+mapping an order-13 folio previously.
> 
> That's where things got wrong, with the current limitations, maybe?
> 
> #define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> 
> Which would translate to MAX_PAGECACHE_ORDER=13 on aarch64 with 64k.
> 
> Staring at xas_split_alloc:
> 
> 	WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT < order)
> 
> I suspect we don't really support THP on systems with CONFIG_BASE_SMALL.
> So we can assume XA_CHUNK_SHIFT == 6.
> 
> I guess that the maximum order we support for splitting is 12? I got confused
> trying to figure that out. ;)

Actually, it's 11.  We can't split an order-12 folio because we'd have
to allocate two levels of radix tree, and I decided that was too much
work.  Also, I didn't know that ARM used order-13 PMD size at the time.

I think this is the best fix (modulo s/12/11/).  Zi Yan and I discussed
improving split_folio() so that it doesn't need to split the entire
folio to order-N.  But that's for the future, and this is the right fix
for now.

For the interested, when we say "I need to split", usually, we mean "I
need to split _this_ part of the folio to order-N", and we're quite
happy to leave the rest of the folio as intact as possible.  If we do
that, then splitting from order-13 to order-0 becomes quite a tractable
task, since we only need to allocate 2 radix tree nodes, not 65.

/**
 * folio_split - Split a smaller folio out of a larger folio.
 * @folio: The containing folio.
 * @page_nr: The page offset within the folio.
 * @order: The order of the folio to return.
 *
 * Splits a folio of order @order from the containing folio.
 * Will contain the page specified by @page_nr, but that page
 * may not be the first page in the returned folio.
 *
 * Context: Caller must hold a reference on @folio and has the folio
 * locked.  The returned folio will be locked and have an elevated
 * refcount; all other folios created by splitting the containing
 * folio will be unlocked and not have an elevated refcount.
 */
struct folio *folio_split(struct folio *folio, unsigned long page_nr,
		unsiged int order);


> I think this does not apply to hugetlb because we never end up splitting
> entries. But could this also apply to shmem + PMD THP?

Urgh, good point.  We need to make that fail on arm64 with 64KB page
size.  Fortunately, it almost always failed anyway; it's really hard to
allocate 512MB pages.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2024-06-19 14:31         ` Matthew Wilcox
@ 2024-06-19 15:48           ` Linus Torvalds
  2024-06-19 19:58             ` David Hildenbrand
  2024-06-19 20:50             ` Matthew Wilcox
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2024-06-19 15:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Gavin Shan, Bagas Sanjaya, Zhenyu Zhang,
	Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Shaoqin Huang, Chandan Babu R,
	Darrick J. Wong, Andrew Morton

On Wed, 19 Jun 2024 at 07:31, Matthew Wilcox <willy@infradead.org> wrote:
>
> Actually, it's 11.  We can't split an order-12 folio because we'd have
> to allocate two levels of radix tree, and I decided that was too much
> work.  Also, I didn't know that ARM used order-13 PMD size at the time.
>
> I think this is the best fix (modulo s/12/11/).

Can we use some more descriptive thing than the magic constant 11 that
is clearly very subtle.

Is it "XA_CHUNK_SHIFT * 2 - 1"

IOW, something like

   #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
   #define MAX_PAGECACHE_ORDER min(HPAGE_PMD_ORDER,12)

except for the non-TRANSPARENT_HUGEPAGE case where it currently does

  #define MAX_PAGECACHE_ORDER    8

and I assume that "8" is just "random round value, smaller than 11"?

             Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2024-06-19 15:48           ` Linus Torvalds
@ 2024-06-19 19:58             ` David Hildenbrand
  2024-06-25  1:10               ` Gavin Shan
  2024-06-19 20:50             ` Matthew Wilcox
  1 sibling, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-06-19 19:58 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Wilcox
  Cc: Gavin Shan, Bagas Sanjaya, Zhenyu Zhang, Linux XFS,
	Linux Filesystems Development, Linux Kernel Mailing List,
	Shaoqin Huang, Chandan Babu R, Darrick J. Wong, Andrew Morton

On 19.06.24 17:48, Linus Torvalds wrote:
> On Wed, 19 Jun 2024 at 07:31, Matthew Wilcox <willy@infradead.org> wrote:
>>
>> Actually, it's 11.  We can't split an order-12 folio because we'd have
>> to allocate two levels of radix tree, and I decided that was too much
>> work.  Also, I didn't know that ARM used order-13 PMD size at the time.
>>
>> I think this is the best fix (modulo s/12/11/).
> 
> Can we use some more descriptive thing than the magic constant 11 that
> is clearly very subtle.
> 
> Is it "XA_CHUNK_SHIFT * 2 - 1"

That's my best guess as well :)

> 
> IOW, something like
> 
>     #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
>     #define MAX_PAGECACHE_ORDER min(HPAGE_PMD_ORDER,12)
> 
> except for the non-TRANSPARENT_HUGEPAGE case where it currently does
> 
>    #define MAX_PAGECACHE_ORDER    8
> 
> and I assume that "8" is just "random round value, smaller than 11"?

Yes, that matches my understanding.

Maybe to be safe for !THP as well, something ike:

+++ b/include/linux/pagemap.h
@@ -354,11 +354,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
   * a good order (that's 1MB if you're using 4kB pages)
   */
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#define WANTED_MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
  #else
-#define MAX_PAGECACHE_ORDER	8
+#define WANTED_MAX_PAGECACHE_ORDER	8
  #endif
  
+/*
+ * xas_split_alloc() does not support arbitrary orders yet. This implies no
+ * 512MB THP on arm64 with 64k.
+ */
+#define MAX_XAS_ORDER		(XA_CHUNK_SHIFT * 2 - 1)
+#define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, WANTED_MAX_PAGECACHE_ORDER)
+
  /**
   * mapping_set_large_folios() - Indicate the file supports large folios.
   * @mapping: The file.
-- 
2.45.2


@Gavin, do you have capacity to test+prepare an official patch? Also,
please double-check whether shmem must be fenced as well (very likely).

This implies no PMD-sized THPs in the pagecache/shmem on arm64 with 64k.
Could be worse, because as Willy said, they are rather rare and extremely
unpredictable.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2024-06-19 19:58             ` David Hildenbrand
@ 2024-06-25  1:10               ` Gavin Shan
  0 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2024-06-25  1:10 UTC (permalink / raw)
  To: David Hildenbrand, Linus Torvalds, Matthew Wilcox
  Cc: Bagas Sanjaya, Zhenyu Zhang, Linux XFS,
	Linux Filesystems Development, Linux Kernel Mailing List,
	Shaoqin Huang, Chandan Babu R, Darrick J. Wong, Andrew Morton

On 6/20/24 5:58 AM, David Hildenbrand wrote:
> On 19.06.24 17:48, Linus Torvalds wrote:
>> On Wed, 19 Jun 2024 at 07:31, Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> Actually, it's 11.  We can't split an order-12 folio because we'd have
>>> to allocate two levels of radix tree, and I decided that was too much
>>> work.  Also, I didn't know that ARM used order-13 PMD size at the time.
>>>
>>> I think this is the best fix (modulo s/12/11/).
>>
>> Can we use some more descriptive thing than the magic constant 11 that
>> is clearly very subtle.
>>
>> Is it "XA_CHUNK_SHIFT * 2 - 1"
> 
> That's my best guess as well :)
> 
>>
>> IOW, something like
>>
>>     #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
>>     #define MAX_PAGECACHE_ORDER min(HPAGE_PMD_ORDER,12)
>>
>> except for the non-TRANSPARENT_HUGEPAGE case where it currently does
>>
>>    #define MAX_PAGECACHE_ORDER    8
>>
>> and I assume that "8" is just "random round value, smaller than 11"?
> 
> Yes, that matches my understanding.
> 
> Maybe to be safe for !THP as well, something ike:
> 
> +++ b/include/linux/pagemap.h
> @@ -354,11 +354,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>    * a good order (that's 1MB if you're using 4kB pages)
>    */
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define MAX_PAGECACHE_ORDER    HPAGE_PMD_ORDER
> +#define WANTED_MAX_PAGECACHE_ORDER    HPAGE_PMD_ORDER
>   #else
> -#define MAX_PAGECACHE_ORDER    8
> +#define WANTED_MAX_PAGECACHE_ORDER    8
>   #endif
> 
> +/*
> + * xas_split_alloc() does not support arbitrary orders yet. This implies no
> + * 512MB THP on arm64 with 64k.
> + */
> +#define MAX_XAS_ORDER        (XA_CHUNK_SHIFT * 2 - 1)
> +#define MAX_PAGECACHE_ORDER    min(MAX_XAS_ORDER, WANTED_MAX_PAGECACHE_ORDER)
> +
>   /**
>    * mapping_set_large_folios() - Indicate the file supports large folios.
>    * @mapping: The file.

Thanks David. I'm checking if shmem needs the similar limitation and test patches.
I will post them for review once they're ready.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Endless calls to xas_split_alloc() due to corrupted xarray entry
  2024-06-19 15:48           ` Linus Torvalds
  2024-06-19 19:58             ` David Hildenbrand
@ 2024-06-19 20:50             ` Matthew Wilcox
  1 sibling, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2024-06-19 20:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Gavin Shan, Bagas Sanjaya, Zhenyu Zhang,
	Linux XFS, Linux Filesystems Development,
	Linux Kernel Mailing List, Shaoqin Huang, Chandan Babu R,
	Darrick J. Wong, Andrew Morton

On Wed, Jun 19, 2024 at 08:48:28AM -0700, Linus Torvalds wrote:
> On Wed, 19 Jun 2024 at 07:31, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Actually, it's 11.  We can't split an order-12 folio because we'd have
> > to allocate two levels of radix tree, and I decided that was too much
> > work.  Also, I didn't know that ARM used order-13 PMD size at the time.
> >
> > I think this is the best fix (modulo s/12/11/).
> 
> Can we use some more descriptive thing than the magic constant 11 that
> is clearly very subtle.
> 
> Is it "XA_CHUNK_SHIFT * 2 - 1"
> 
> IOW, something like
> 
>    #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1)
>    #define MAX_PAGECACHE_ORDER min(HPAGE_PMD_ORDER,12)
> 
> except for the non-TRANSPARENT_HUGEPAGE case where it currently does
> 
>   #define MAX_PAGECACHE_ORDER    8
> 
> and I assume that "8" is just "random round value, smaller than 11"?

It's actually documented:

/*
 * There are some parts of the kernel which assume that PMD entries
 * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
 * limit the maximum allocation order to PMD size.  I'm not aware of any
 * assumptions about maximum order if THP are disabled, but 8 seems like
 * a good order (that's 1MB if you're using 4kB pages)
 */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define MAX_PAGECACHE_ORDER     HPAGE_PMD_ORDER
#else
#define MAX_PAGECACHE_ORDER     8
#endif

although I'm not even sure if we use it if CONFIG_TRANSPARENT_HUGEPAGE
is disabled.  All the machinery to split pages is gated by
CONFIG_TRANSPARENT_HUGEPAGE, so I think we end up completely ignoring
it.  I used to say "somebody should do the work to split out
CONFIG_LARGE_FOLIOS from CONFIG_TRANSPARENT_HUGEPAGE", but now I think
that nobody cares about the architectures that don't support it,
and it's not worth anybody's time pretending that we do.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-06-25  1:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAJFLiB+J4mKGDOppp=1moMe2aNqeJhM9F2cD4KPTXoM6nzb5RA@mail.gmail.com>
     [not found] ` <ZRFbIJH47RkQuDid@debian.me>
2023-09-25 15:12   ` Endless calls to xas_split_alloc() due to corrupted xarray entry Darrick J. Wong
2023-09-26  7:49     ` Zhenyu Zhang
2023-09-29 10:11       ` Gavin Shan
2023-09-29 19:17   ` Matthew Wilcox
2023-09-30  2:12     ` Gavin Shan
2024-06-19  9:45       ` David Hildenbrand
2024-06-19 14:31         ` Matthew Wilcox
2024-06-19 15:48           ` Linus Torvalds
2024-06-19 19:58             ` David Hildenbrand
2024-06-25  1:10               ` Gavin Shan
2024-06-19 20:50             ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).