From: Alistair Popple <apopple@nvidia.com>
To: "Arsen Arsenović" <aarsenovic@baylibre.com>
Cc: amd-gfx@lists.freedesktop.org, linux-mm@kvack.org,
cs-tech-ext@baylibre.com
Subject: Re: [BUG] Frequent hangs or WARNINGs when using heterogeneous memory with an AMD MI210 GPU
Date: Fri, 1 May 2026 16:21:36 +1000 [thread overview]
Message-ID: <afREaF6hcka_cxnY@nvdebian.thelocal> (raw)
In-Reply-To: <86tssu0w8p.fsf@baylibre.com>
On 2026-04-29 at 22:47 +1000, Arsen Arsenović <aarsenovic@baylibre.com> wrote...
> Arsen Arsenović <aarsenovic@baylibre.com> writes:
>
> > We get this by running the following OpenMP program built for offloading
> > onto an AMD GPU:
> >
> > https://gcc.gnu.org/cgit/gcc/tree/libgomp/testsuite/libgomp.c++/pr119692-1-4.C
> >
> > ... built by:
> >
> > x86_64-none-linux-gnu-g++ pr119692-1-4.C -foffload=-march=gfx90a \
> > -Wl,-rpath,/opt/rocm/lib -fopenmp -O2 \
> > -DDEFAULT='defaultmap(firstprivate)' \
> > -lm -o ./pr119692-1-4.exe
> >
> > ... using trunk GCC configured for amdgcn-amdhsa offloading[1] and
> > executed as:
> >
> > timeout --verbose 10s env HSA_XNACK=1 LD_LIBRARY_PATH=. ./pr119692-1-4.exe
> >
> > ... when the timeout happens (i.e. the program gets stuck for 10 seconds
> > and then, when 10 seconds pass, timeout sends a SIGTERM to a.out, and
> > results in the crash above).
>
> I've now confirmed that it is possible to reproduce this specific issue
> also on bare metal, also with kernel 7.0.2 and ROCm 7.2.2 (using the
> rocm/dev-ubuntu-22.04:7.2.2 Docker image):
>
> [ 1171.959571] ------------[ cut here ]------------
> [ 1171.959577] WARNING: mm/memory.c:1753 at unmap_page_range+0x10d5/0x1bc0, CPU#247: pr119692-1-4.ex/143761
I don't know the AMD driver well enough to comment definitively but chances are
this warning is spurious. I have been meaning to put togeather a fix for it.
The problem is that migrate_vma_setup() etc. allow for migration of anonymous
folios, which is subtly different from only allowing migration of anonymous
VMA's.
Specifically migrate_vma checks for folio_test_anon() which returns true for
private file-backed VMAs while the warning is based on vma_is_anonymous()
which is false for such mappings. So it is possible for the driver to migrate a
private filebacked mapping to GPU memory which will trigger this warning during
teardown if the page wasn't migrated back.
- Alistair
> [ 1171.959613] Modules linked in: xt_iprange xt_LOG nf_log_syslog xt_comment amdgpu amdxcp drm_ttm_helper ttm drm_exec drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video drm_buddy drm_display_helper cec rc_core iptable_nat iptable_filter vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb nf_conntrack_netlink xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_tcpudp br_netfilter xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nfsv3 nfs netfs overlay 8021q garp mrp bridge stp llc bonding tls nf_tables nfnetlink binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass rapl wmi_bmof pcspkr ccp input_leds joydev mac_hid acpi_ipmi ptdma ipmi_si k10temp ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd sch_fq_codel dm_multipath grace scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc msr efi_pstore ip_tables x_tables
> [ 1171.959847] autofs4 btrfs libblake2b raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 hid_generic usbmouse igb bnxt_en ghash_clmulni_intel usbhid ast rndis_host ahci cdc_ether libahci dca usbnet hid i2c_algo_bit mii i2c_piix4 i2c_smbus wmi aesni_intel
> [ 1171.959939] CPU: 247 UID: 0 PID: 143761 Comm: pr119692-1-4.ex Not tainted 7.0.2-instinct-arsen #3 PREEMPT(lazy)
> [ 1171.959947] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 2.8 01/26/2024
> [ 1171.959951] RIP: 0010:unmap_page_range+0x10d5/0x1bc0
> [ 1171.959959] Code: 2e 2e 2e 31 c0 4c 39 b5 50 ff ff ff 0f 85 72 f2 ff ff e9 b1 fd ff ff 48 8b 45 90 48 8b 53 18 48 83 78 48 00 0f 84 28 f9 ff ff <0f> 0b e9 21 f9 ff ff a9 ff 0f 00 00 0f 85 cb fb ff ff 48 8b 10 83
> [ 1171.959964] RSP: 0018:ffffce40ffc87920 EFLAGS: 00010286
> [ 1171.959969] RAX: ffff8e18cb2ee900 RBX: fffff3333ffb6a00 RCX: 0000000000000000
> [ 1171.959973] RDX: ffff8e18de1b18c9 RSI: 0000000000000005 RDI: 0000000000000000
> [ 1171.959976] RBP: ffffce40ffc87a30 R08: 0000000000000000 R09: 0000000000000000
> [ 1171.959979] R10: 0000000000000000 R11: 0000000000000000 R12: ffffce40ffc87b90
> [ 1171.959983] R13: fffff3333ffb6a00 R14: 0000000000000001 R15: ffff8e18ba912018
> [ 1171.959986] FS: 0000000000000000(0000) GS:ffff8e57ac3da000(0000) knlGS:0000000000000000
> [ 1171.959990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1171.959994] CR2: 000070d717bfe920 CR3: 0000004169a48002 CR4: 0000000000f70ef0
> [ 1171.960000] PKRU: 55555554
> [ 1171.960004] Call Trace:
> [ 1171.960008] <TASK>
> [ 1171.960022] unmap_single_vma+0x96/0x110
> [ 1171.960031] unmap_vmas+0xa5/0x180
> [ 1171.960041] exit_mmap+0x13b/0x400
> [ 1171.960060] __mmput+0x45/0x170
> [ 1171.960068] mmput+0x31/0x40
> [ 1171.960074] do_exit+0x285/0xad0
> [ 1171.960083] do_group_exit+0x2d/0xb0
> [ 1171.960090] get_signal+0x86a/0x930
> [ 1171.960099] ? kfd_ioctl+0x4ad/0x5c0 [amdgpu]
> [ 1171.960563] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 1171.960570] ? __x64_sys_ioctl+0xbd/0x100
> [ 1171.960580] arch_do_signal_or_restart+0x3a/0x250
> [ 1171.960608] exit_to_user_mode_loop+0x8f/0x500
> [ 1171.960618] do_syscall_64+0x2cd/0x14b0
> [ 1171.960626] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 1171.960631] ? handle_mm_fault+0x1e8/0x2f0
> [ 1171.960640] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 1171.960646] ? do_user_addr_fault+0x2ee/0x830
> [ 1171.960655] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 1171.960660] ? irqentry_exit+0xa5/0x600
> [ 1171.960670] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 1171.960676] ? exc_page_fault+0x94/0x1e0
> [ 1171.960682] ? ret_from_fork+0x1b2/0x3a0
> [ 1171.960691] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1171.960697] RIP: 0033:0x70d718dab9cf
> [ 1171.960704] Code: Unable to access opcode bytes at 0x70d718dab9a5.
> [ 1171.960708] RSP: 002b:000070d717bfda90 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 1171.960716] RAX: fffffffffffffffc RBX: 0000000000000003 RCX: 000070d718dab9cf
> [ 1171.960720] RDX: 000070d717bfdb60 RSI: 00000000c0184b0c RDI: 0000000000000003
> [ 1171.960725] RBP: 00000000c0184b0c R08: 0000000040000001 R09: 000070d708000dd0
> [ 1171.960728] R10: 000070d71902bc68 R11: 0000000000000246 R12: 000070d717bfdc10
> [ 1171.960732] R13: 000070d717bfdb60 R14: 0000000031050b60 R15: 000070d708000dd0
> [ 1171.960741] </TASK>
> [ 1171.960746] ---[ end trace 0000000000000000 ]---
>
> I'll try the other testcase we had (omptests t-unified-* all running in
> parallel) later also.
> --
> Arsen Arsenović
next prev parent reply other threads:[~2026-05-01 6:22 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 16:10 [BUG] Frequent hangs or WARNINGs when using heterogeneous memory with an AMD MI210 GPU Arsen Arsenović
2026-04-29 12:47 ` Arsen Arsenović
2026-05-01 6:21 ` Alistair Popple [this message]
2026-05-01 14:25 ` Arsen Arsenović
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afREaF6hcka_cxnY@nvdebian.thelocal \
--to=apopple@nvidia.com \
--cc=aarsenovic@baylibre.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=cs-tech-ext@baylibre.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox