Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: "Arsen Arsenović" <aarsenovic@baylibre.com>
Cc: amd-gfx@lists.freedesktop.org, linux-mm@kvack.org,
	 cs-tech-ext@baylibre.com
Subject: Re: [BUG] Frequent hangs or WARNINGs when using heterogeneous memory with an AMD MI210 GPU
Date: Fri, 1 May 2026 16:21:36 +1000	[thread overview]
Message-ID: <afREaF6hcka_cxnY@nvdebian.thelocal> (raw)
In-Reply-To: <86tssu0w8p.fsf@baylibre.com>

On 2026-04-29 at 22:47 +1000, Arsen Arsenović <aarsenovic@baylibre.com> wrote...
> Arsen Arsenović <aarsenovic@baylibre.com> writes:
> 
> > We get this by running the following OpenMP program built for offloading
> > onto an AMD GPU:
> >
> >   https://gcc.gnu.org/cgit/gcc/tree/libgomp/testsuite/libgomp.c++/pr119692-1-4.C
> >
> > ... built by:
> >
> >   x86_64-none-linux-gnu-g++ pr119692-1-4.C -foffload=-march=gfx90a \
> >     -Wl,-rpath,/opt/rocm/lib -fopenmp -O2 \
> >     -DDEFAULT='defaultmap(firstprivate)' \
> >     -lm -o ./pr119692-1-4.exe
> >
> > ... using trunk GCC configured for amdgcn-amdhsa offloading[1] and
> > executed as:
> >
> >   timeout --verbose 10s env HSA_XNACK=1 LD_LIBRARY_PATH=. ./pr119692-1-4.exe
> >
> > ... when the timeout happens (i.e. the program gets stuck for 10 seconds
> > and then, when 10 seconds pass, timeout sends a SIGTERM to a.out, and
> > results in the crash above).
> 
> I've now confirmed that it is possible to reproduce this specific issue
> also on bare metal, also with kernel 7.0.2 and ROCm 7.2.2 (using the
> rocm/dev-ubuntu-22.04:7.2.2 Docker image):
> 
>   [ 1171.959571] ------------[ cut here ]------------
>   [ 1171.959577] WARNING: mm/memory.c:1753 at unmap_page_range+0x10d5/0x1bc0, CPU#247: pr119692-1-4.ex/143761

I don't know the AMD driver well enough to comment definitively but chances are
this warning is spurious. I have been meaning to put togeather a fix for it.
The problem is that migrate_vma_setup() etc. allow for migration of anonymous
folios, which is subtly different from only allowing migration of anonymous
VMA's.

Specifically migrate_vma checks for folio_test_anon() which returns true for
private file-backed VMAs while the warning is based on vma_is_anonymous()
which is false for such mappings. So it is possible for the driver to migrate a
private filebacked mapping to GPU memory which will trigger this warning during
teardown if the page wasn't migrated back.

 - Alistair

>   [ 1171.959613] Modules linked in: xt_iprange xt_LOG nf_log_syslog xt_comment amdgpu amdxcp drm_ttm_helper ttm drm_exec drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video drm_buddy drm_display_helper cec rc_core iptable_nat iptable_filter vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb nf_conntrack_netlink xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_tcpudp br_netfilter xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nfsv3 nfs netfs overlay 8021q garp mrp bridge stp llc bonding tls nf_tables nfnetlink binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass rapl wmi_bmof pcspkr ccp input_leds joydev mac_hid acpi_ipmi ptdma ipmi_si k10temp ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd sch_fq_codel dm_multipath grace scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc msr efi_pstore ip_tables x_tables
>   [ 1171.959847]  autofs4 btrfs libblake2b raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 hid_generic usbmouse igb bnxt_en ghash_clmulni_intel usbhid ast rndis_host ahci cdc_ether libahci dca usbnet hid i2c_algo_bit mii i2c_piix4 i2c_smbus wmi aesni_intel
>   [ 1171.959939] CPU: 247 UID: 0 PID: 143761 Comm: pr119692-1-4.ex Not tainted 7.0.2-instinct-arsen #3 PREEMPT(lazy)
>   [ 1171.959947] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 2.8 01/26/2024
>   [ 1171.959951] RIP: 0010:unmap_page_range+0x10d5/0x1bc0
>   [ 1171.959959] Code: 2e 2e 2e 31 c0 4c 39 b5 50 ff ff ff 0f 85 72 f2 ff ff e9 b1 fd ff ff 48 8b 45 90 48 8b 53 18 48 83 78 48 00 0f 84 28 f9 ff ff <0f> 0b e9 21 f9 ff ff a9 ff 0f 00 00 0f 85 cb fb ff ff 48 8b 10 83
>   [ 1171.959964] RSP: 0018:ffffce40ffc87920 EFLAGS: 00010286
>   [ 1171.959969] RAX: ffff8e18cb2ee900 RBX: fffff3333ffb6a00 RCX: 0000000000000000
>   [ 1171.959973] RDX: ffff8e18de1b18c9 RSI: 0000000000000005 RDI: 0000000000000000
>   [ 1171.959976] RBP: ffffce40ffc87a30 R08: 0000000000000000 R09: 0000000000000000
>   [ 1171.959979] R10: 0000000000000000 R11: 0000000000000000 R12: ffffce40ffc87b90
>   [ 1171.959983] R13: fffff3333ffb6a00 R14: 0000000000000001 R15: ffff8e18ba912018
>   [ 1171.959986] FS:  0000000000000000(0000) GS:ffff8e57ac3da000(0000) knlGS:0000000000000000
>   [ 1171.959990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [ 1171.959994] CR2: 000070d717bfe920 CR3: 0000004169a48002 CR4: 0000000000f70ef0
>   [ 1171.960000] PKRU: 55555554
>   [ 1171.960004] Call Trace:
>   [ 1171.960008]  <TASK>
>   [ 1171.960022]  unmap_single_vma+0x96/0x110
>   [ 1171.960031]  unmap_vmas+0xa5/0x180
>   [ 1171.960041]  exit_mmap+0x13b/0x400
>   [ 1171.960060]  __mmput+0x45/0x170
>   [ 1171.960068]  mmput+0x31/0x40
>   [ 1171.960074]  do_exit+0x285/0xad0
>   [ 1171.960083]  do_group_exit+0x2d/0xb0
>   [ 1171.960090]  get_signal+0x86a/0x930
>   [ 1171.960099]  ? kfd_ioctl+0x4ad/0x5c0 [amdgpu]
>   [ 1171.960563]  ? srso_alias_return_thunk+0x5/0xfbef5
>   [ 1171.960570]  ? __x64_sys_ioctl+0xbd/0x100
>   [ 1171.960580]  arch_do_signal_or_restart+0x3a/0x250
>   [ 1171.960608]  exit_to_user_mode_loop+0x8f/0x500
>   [ 1171.960618]  do_syscall_64+0x2cd/0x14b0
>   [ 1171.960626]  ? srso_alias_return_thunk+0x5/0xfbef5
>   [ 1171.960631]  ? handle_mm_fault+0x1e8/0x2f0
>   [ 1171.960640]  ? srso_alias_return_thunk+0x5/0xfbef5
>   [ 1171.960646]  ? do_user_addr_fault+0x2ee/0x830
>   [ 1171.960655]  ? srso_alias_return_thunk+0x5/0xfbef5
>   [ 1171.960660]  ? irqentry_exit+0xa5/0x600
>   [ 1171.960670]  ? srso_alias_return_thunk+0x5/0xfbef5
>   [ 1171.960676]  ? exc_page_fault+0x94/0x1e0
>   [ 1171.960682]  ? ret_from_fork+0x1b2/0x3a0
>   [ 1171.960691]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   [ 1171.960697] RIP: 0033:0x70d718dab9cf
>   [ 1171.960704] Code: Unable to access opcode bytes at 0x70d718dab9a5.
>   [ 1171.960708] RSP: 002b:000070d717bfda90 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>   [ 1171.960716] RAX: fffffffffffffffc RBX: 0000000000000003 RCX: 000070d718dab9cf
>   [ 1171.960720] RDX: 000070d717bfdb60 RSI: 00000000c0184b0c RDI: 0000000000000003
>   [ 1171.960725] RBP: 00000000c0184b0c R08: 0000000040000001 R09: 000070d708000dd0
>   [ 1171.960728] R10: 000070d71902bc68 R11: 0000000000000246 R12: 000070d717bfdc10
>   [ 1171.960732] R13: 000070d717bfdb60 R14: 0000000031050b60 R15: 000070d708000dd0
>   [ 1171.960741]  </TASK>
>   [ 1171.960746] ---[ end trace 0000000000000000 ]---
> 
> I'll try the other testcase we had (omptests t-unified-* all running in
> parallel) later also.
> -- 
> Arsen Arsenović




  reply	other threads:[~2026-05-01  6:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 16:10 [BUG] Frequent hangs or WARNINGs when using heterogeneous memory with an AMD MI210 GPU Arsen Arsenović
2026-04-29 12:47 ` Arsen Arsenović
2026-05-01  6:21   ` Alistair Popple [this message]
2026-05-01 14:25     ` Arsen Arsenović

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afREaF6hcka_cxnY@nvdebian.thelocal \
    --to=apopple@nvidia.com \
    --cc=aarsenovic@baylibre.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=cs-tech-ext@baylibre.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox