Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Arsen Arsenović" <aarsenovic@baylibre.com>
To: amd-gfx@lists.freedesktop.org,  linux-mm@kvack.org
Cc: cs-tech-ext@baylibre.com
Subject: Re: [BUG] Frequent hangs or WARNINGs when using heterogeneous memory with an AMD MI210 GPU
Date: Wed, 29 Apr 2026 14:47:02 +0200	[thread overview]
Message-ID: <86tssu0w8p.fsf@baylibre.com> (raw)
In-Reply-To: <86ecjz2hhr.fsf@baylibre.com>

[-- Attachment #1: Type: text/plain, Size: 5820 bytes --]

Arsen Arsenović <aarsenovic@baylibre.com> writes:

> We get this by running the following OpenMP program built for offloading
> onto an AMD GPU:
>
>   https://gcc.gnu.org/cgit/gcc/tree/libgomp/testsuite/libgomp.c++/pr119692-1-4.C
>
> ... built by:
>
>   x86_64-none-linux-gnu-g++ pr119692-1-4.C -foffload=-march=gfx90a \
>     -Wl,-rpath,/opt/rocm/lib -fopenmp -O2 \
>     -DDEFAULT='defaultmap(firstprivate)' \
>     -lm -o ./pr119692-1-4.exe
>
> ... using trunk GCC configured for amdgcn-amdhsa offloading[1] and
> executed as:
>
>   timeout --verbose 10s env HSA_XNACK=1 LD_LIBRARY_PATH=. ./pr119692-1-4.exe
>
> ... when the timeout happens (i.e. the program gets stuck for 10 seconds
> and then, when 10 seconds pass, timeout sends a SIGTERM to a.out, and
> results in the crash above).

I've now confirmed that it is possible to reproduce this specific issue
also on bare metal, also with kernel 7.0.2 and ROCm 7.2.2 (using the
rocm/dev-ubuntu-22.04:7.2.2 Docker image):

  [ 1171.959571] ------------[ cut here ]------------
  [ 1171.959577] WARNING: mm/memory.c:1753 at unmap_page_range+0x10d5/0x1bc0, CPU#247: pr119692-1-4.ex/143761
  [ 1171.959613] Modules linked in: xt_iprange xt_LOG nf_log_syslog xt_comment amdgpu amdxcp drm_ttm_helper ttm drm_exec drm_panel_backlight_quirks gpu_sched drm_suballoc_helper video drm_buddy drm_display_helper cec rc_core iptable_nat iptable_filter vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb nf_conntrack_netlink xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_tcpudp br_netfilter xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nfsv3 nfs netfs overlay 8021q garp mrp bridge stp llc bonding tls nf_tables nfnetlink binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass rapl wmi_bmof pcspkr ccp input_leds joydev mac_hid acpi_ipmi ptdma ipmi_si k10temp ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd sch_fq_codel dm_multipath grace scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc msr efi_pstore ip_tables x_tables
  [ 1171.959847]  autofs4 btrfs libblake2b raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 hid_generic usbmouse igb bnxt_en ghash_clmulni_intel usbhid ast rndis_host ahci cdc_ether libahci dca usbnet hid i2c_algo_bit mii i2c_piix4 i2c_smbus wmi aesni_intel
  [ 1171.959939] CPU: 247 UID: 0 PID: 143761 Comm: pr119692-1-4.ex Not tainted 7.0.2-instinct-arsen #3 PREEMPT(lazy)
  [ 1171.959947] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 2.8 01/26/2024
  [ 1171.959951] RIP: 0010:unmap_page_range+0x10d5/0x1bc0
  [ 1171.959959] Code: 2e 2e 2e 31 c0 4c 39 b5 50 ff ff ff 0f 85 72 f2 ff ff e9 b1 fd ff ff 48 8b 45 90 48 8b 53 18 48 83 78 48 00 0f 84 28 f9 ff ff <0f> 0b e9 21 f9 ff ff a9 ff 0f 00 00 0f 85 cb fb ff ff 48 8b 10 83
  [ 1171.959964] RSP: 0018:ffffce40ffc87920 EFLAGS: 00010286
  [ 1171.959969] RAX: ffff8e18cb2ee900 RBX: fffff3333ffb6a00 RCX: 0000000000000000
  [ 1171.959973] RDX: ffff8e18de1b18c9 RSI: 0000000000000005 RDI: 0000000000000000
  [ 1171.959976] RBP: ffffce40ffc87a30 R08: 0000000000000000 R09: 0000000000000000
  [ 1171.959979] R10: 0000000000000000 R11: 0000000000000000 R12: ffffce40ffc87b90
  [ 1171.959983] R13: fffff3333ffb6a00 R14: 0000000000000001 R15: ffff8e18ba912018
  [ 1171.959986] FS:  0000000000000000(0000) GS:ffff8e57ac3da000(0000) knlGS:0000000000000000
  [ 1171.959990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1171.959994] CR2: 000070d717bfe920 CR3: 0000004169a48002 CR4: 0000000000f70ef0
  [ 1171.960000] PKRU: 55555554
  [ 1171.960004] Call Trace:
  [ 1171.960008]  <TASK>
  [ 1171.960022]  unmap_single_vma+0x96/0x110
  [ 1171.960031]  unmap_vmas+0xa5/0x180
  [ 1171.960041]  exit_mmap+0x13b/0x400
  [ 1171.960060]  __mmput+0x45/0x170
  [ 1171.960068]  mmput+0x31/0x40
  [ 1171.960074]  do_exit+0x285/0xad0
  [ 1171.960083]  do_group_exit+0x2d/0xb0
  [ 1171.960090]  get_signal+0x86a/0x930
  [ 1171.960099]  ? kfd_ioctl+0x4ad/0x5c0 [amdgpu]
  [ 1171.960563]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 1171.960570]  ? __x64_sys_ioctl+0xbd/0x100
  [ 1171.960580]  arch_do_signal_or_restart+0x3a/0x250
  [ 1171.960608]  exit_to_user_mode_loop+0x8f/0x500
  [ 1171.960618]  do_syscall_64+0x2cd/0x14b0
  [ 1171.960626]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 1171.960631]  ? handle_mm_fault+0x1e8/0x2f0
  [ 1171.960640]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 1171.960646]  ? do_user_addr_fault+0x2ee/0x830
  [ 1171.960655]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 1171.960660]  ? irqentry_exit+0xa5/0x600
  [ 1171.960670]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 1171.960676]  ? exc_page_fault+0x94/0x1e0
  [ 1171.960682]  ? ret_from_fork+0x1b2/0x3a0
  [ 1171.960691]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [ 1171.960697] RIP: 0033:0x70d718dab9cf
  [ 1171.960704] Code: Unable to access opcode bytes at 0x70d718dab9a5.
  [ 1171.960708] RSP: 002b:000070d717bfda90 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [ 1171.960716] RAX: fffffffffffffffc RBX: 0000000000000003 RCX: 000070d718dab9cf
  [ 1171.960720] RDX: 000070d717bfdb60 RSI: 00000000c0184b0c RDI: 0000000000000003
  [ 1171.960725] RBP: 00000000c0184b0c R08: 0000000040000001 R09: 000070d708000dd0
  [ 1171.960728] R10: 000070d71902bc68 R11: 0000000000000246 R12: 000070d717bfdc10
  [ 1171.960732] R13: 000070d717bfdb60 R14: 0000000031050b60 R15: 000070d708000dd0
  [ 1171.960741]  </TASK>
  [ 1171.960746] ---[ end trace 0000000000000000 ]---

I'll try the other testcase we had (omptests t-unified-* all running in
parallel) later also.
-- 
Arsen Arsenović

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 430 bytes --]

  reply	other threads:[~2026-04-29 12:47 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 16:10 [BUG] Frequent hangs or WARNINGs when using heterogeneous memory with an AMD MI210 GPU Arsen Arsenović
2026-04-29 12:47 ` Arsen Arsenović [this message]
2026-05-01  6:21   ` Alistair Popple
2026-05-01 14:25     ` Arsen Arsenović

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86tssu0w8p.fsf@baylibre.com \
    --to=aarsenovic@baylibre.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=cs-tech-ext@baylibre.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox