All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Lijo Lazar <lijo.lazar@amd.com>,
	Mangesh Gadre <Mangesh.Gadre@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	christian.koenig@amd.com, airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 6.19-6.18] drm/amdgpu: Skip vcn poison irq release on VF
Date: Mon, 23 Feb 2026 07:37:14 -0500	[thread overview]
Message-ID: <20260223123738.1532940-9-sashal@kernel.org> (raw)
In-Reply-To: <20260223123738.1532940-1-sashal@kernel.org>

From: Lijo Lazar <lijo.lazar@amd.com>

[ Upstream commit 8980be03b3f9a4b58197ef95d3b37efa41a25331 ]

VF doesn't enable VCN poison irq in VCNv2.5. Skip releasing it and avoid
call trace during deinitialization.

[   71.913601] [drm] clean up the vf2pf work item
[   71.915088] ------------[ cut here ]------------
[   71.915092] WARNING: CPU: 3 PID: 1079 at /tmp/amd.aFkFvSQl/amd/amdgpu/amdgpu_irq.c:641 amdgpu_irq_put+0xc6/0xe0 [amdgpu]
[   71.915355] Modules linked in: amdgpu(OE-) amddrm_ttm_helper(OE) amdttm(OE) amddrm_buddy(OE) amdxcp(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) drm_suballoc_helper drm_display_helper cec rc_core i2c_algo_bit video wmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common input_leds joydev serio_raw mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel usbhid 8139too sha256_ssse3 sha1_ssse3 hid psmouse bochs i2c_i801 ahci drm_vram_helper libahci i2c_smbus lpc_ich drm_ttm_helper 8139cp mii ttm aesni_intel crypto_simd cryptd
[   71.915484] CPU: 3 PID: 1079 Comm: rmmod Tainted: G           OE      6.8.0-87-generic #88~22.04.1-Ubuntu
[   71.915489] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014
[   71.915492] RIP: 0010:amdgpu_irq_put+0xc6/0xe0 [amdgpu]
[   71.915768] Code: 75 84 b8 ea ff ff ff eb d4 44 89 ea 48 89 de 4c 89 e7 e8 fd fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 31 f6 31 ff e9 55 30 3b c7 <0f> 0b eb d4 b8 fe ff ff ff eb a8 e9 b7 3b 8a 00 66 2e 0f 1f 84 00
[   71.915771] RSP: 0018:ffffcf0800eafa30 EFLAGS: 00010246
[   71.915775] RAX: 0000000000000000 RBX: ffff891bda4b0668 RCX: 0000000000000000
[   71.915777] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   71.915779] RBP: ffffcf0800eafa50 R08: 0000000000000000 R09: 0000000000000000
[   71.915781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff891bda480000
[   71.915782] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[   71.915792] FS:  000070cff87c4c40(0000) GS:ffff893abfb80000(0000) knlGS:0000000000000000
[   71.915795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.915797] CR2: 00005fa13073e478 CR3: 000000010d634006 CR4: 0000000000770ef0
[   71.915800] PKRU: 55555554
[   71.915802] Call Trace:
[   71.915805]  <TASK>
[   71.915809]  vcn_v2_5_hw_fini+0x19e/0x1e0 [amdgpu]

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a comprehensive understanding. Let me summarize the analysis.

## Analysis

### 1. Commit Message Analysis

The commit fixes a **WARNING/call trace during deinitialization** on AMD
GPU Virtual Functions (VF, i.e., SR-IOV guests). The commit message
includes the full stack trace showing the `WARN_ON` trigger at
`amdgpu_irq_put+0xc6/0xe0` called from `vcn_v2_5_hw_fini+0x19e/0x1e0`,
occurring during `rmmod` of the amdgpu driver.

### 2. Code Change Analysis

**The bug**: In `vcn_v2_5_hw_fini()`, the code unconditionally calls
`amdgpu_irq_put()` for `ras_poison_irq` whenever RAS is supported.
However, for SR-IOV VF (Virtual Function) environments,
`amdgpu_irq_get()` was **never called** on this IRQ source during
initialization. The IRQ enable path is in `amdgpu_vcn_ras_late_init()`
which runs through the RAS block late init — but for VFs, the RAS
interrupt operations are not fully initialized/enabled (as the comment
in the fix says: "VF doesn't enable interrupt operations for RAS").

When `amdgpu_irq_put()` is called on an IRQ that was never enabled
(refcount is 0), line 639 of `amdgpu_irq.c` triggers:
`WARN_ON(!amdgpu_irq_enabled(adev, src, type))`, causing the stack trace
shown in the commit message.

**The fix**: Adds `!amdgpu_sriov_vf(adev)` check before calling
`amdgpu_irq_put()`, so the IRQ release is skipped on VF — matching the
fact that it was never enabled on VF. This is a minimal 2-line change
(adding the VF check to the existing conditional).

### 3. Classification

This is a **bug fix** — it fixes a mismatched IRQ get/put that causes a
WARNING and call trace during driver deinitialization (rmmod) on SR-IOV
VF environments. The fix is:
- Obviously correct (symmetry between init/fini paths)
- Small and surgical (2-line change to an existing conditional)
- Fixes a real user-visible issue (WARNING + call trace during rmmod)
- No new features, no API changes

### 4. Scope and Risk

- **Very small change**: Only adds a VF check to an existing `if`
  condition
- **Low risk**: The change only affects SR-IOV VF environments, and it
  simply skips an operation that should never have run in that context
- **Single file**: Only `vcn_v2_5.c` is modified
- **Well-understood pattern**: Other VCN versions (e.g., vcn_v4_0.c)
  have similar structures, and the author (Lijo Lazar) is an AMD kernel
  developer familiar with the subsystem

### 5. User Impact

This affects AMD GPU users running in SR-IOV virtualized environments
(VMs using AMD GPU virtual functions). When they unload the amdgpu
driver (rmmod), they see a WARNING call trace in the kernel log. While
not a crash, this is a real operational annoyance and the WARN_ON could
trigger panic-on-warn systems.

### 6. Stability

- Reviewed by Mangesh Gadre (AMD engineer)
- Signed off by Alex Deucher (AMD GPU maintainer)
- The fix is logically straightforward — matching the fini path to what
  the init path actually does

### 7. Dependencies

None. The fix is self-contained and uses existing APIs
(`amdgpu_sriov_vf()`) that have been in the kernel for a long time. The
`vcn_v2_5.c` file and the relevant code paths exist in all supported
stable trees that include VCN 2.5 support.

## Verification

- Read `vcn_v2_5.c:506-529` (pre-fix code on current tree) confirming
  the `amdgpu_irq_put` call at line 524-525 lacks the VF check
- Read `amdgpu_irq.c:623-646` confirming the `WARN_ON` at line 639
  matches the stack trace
- Read `amdgpu_vcn.c:1286-1310` confirming `amdgpu_vcn_ras_late_init`
  calls `amdgpu_irq_get` for poison IRQ — but this may not run for VF
  environments
- Verified commit `8980be03b3f9a` exists in tree and matches the
  analyzed patch
- Searched for similar patterns (`sriov_vf` + `ras_poison_irq`) — no
  other instance found, confirming this is the only VCN version patched
  (VCN 2.5 specific)
- The `vcn_v4_0.c` hw_fini was noted by the explore agent as having the
  same pattern (no VF check before irq_put) — this is potentially a
  separate issue for other VCN versions
- Could NOT verify which stable trees contain VCN 2.5 support
  specifically — unverified, but VCN 2.5 has been in the kernel since
  ~5.4-5.5 era

## Conclusion

This is a small, surgical bug fix that resolves a WARNING/call trace
during driver deinitialization on SR-IOV VF environments. It fixes an
asymmetry between the init and fini paths (IRQ never enabled on VF, but
unconditionally released). The fix is obviously correct, low risk, well-
reviewed, and impacts real users of AMD GPU virtualization. It meets all
stable kernel criteria.

**YES**

 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
index cebee453871c1..006a154511971 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c
@@ -521,7 +521,9 @@ static int vcn_v2_5_hw_fini(struct amdgpu_ip_block *ip_block)
 		     RREG32_SOC15(VCN, i, mmUVD_STATUS)))
 			vinst->set_pg_state(vinst, AMD_PG_STATE_GATE);
 
-		if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__VCN))
+		/* VF doesn't enable interrupt operations for RAS */
+		if (!amdgpu_sriov_vf(adev) &&
+		    amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__VCN))
 			amdgpu_irq_put(adev, &vinst->ras_poison_irq, 0);
 	}
 
-- 
2.51.0


  parent reply	other threads:[~2026-02-23 12:37 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 12:37 [PATCH AUTOSEL 6.19-6.1] drm/amd/display: Remove conditional for shaper 3DLUT power-on Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] ASoC: rt721-sdca: Fix issue of fail to detect OMTP jack type Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] ALSA: hda/tas2781: Ignore reset check for SPI device Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.15] btrfs: replace BUG() with error handling in __btrfs_balance() Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.15] ALSA: usb-audio: Add sanity check for OOB writes at silencing Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.12] drm/amd/display: Fix system resume lag issue Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.12] arm64: hugetlbpage: avoid unused-but-set-parameter warning (gcc-16) Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.12] drm/amd/display: Fix writeback on DCN 3.2+ Sasha Levin
2026-02-23 12:37 ` Sasha Levin [this message]
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] drm/amdgpu: return when ras table checksum is error Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] regulator: core: Remove regulator supply_name length limit Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.10] ARM: 9467/1: mm: Don't use %pK through printk Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.10] drm/radeon: Add HAINAN clock adjustment Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] drm/amdgpu: avoid sdma ring reset in sriov Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.12] spi: spidev: fix lock inversion between spi_lock and buf_lock Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.15] drm/amdgpu: Adjust usleep_range in fence wait Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19] mshv: Ignore second stats page map result failure Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19] btrfs: do not ASSERT() when the fs flips RO inside btrfs_repair_io_failure() Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] ALSA: hda/hdmi: Add quirk for TUXEDO IBS14G6 Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19] drm/amd/display: set enable_legacy_fast_update to false for DCN36 Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19] x86/hyperv: Move hv crash init after hypercall pg setup Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] mshv: clear eventfd counter on irqfd shutdown Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.10] drm/amd/display: Avoid updating surface with the same surface under MPO Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.15] ALSA: usb-audio: Update the number of packets properly at receiving Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.12] drm/amd/display: bypass post csc for additional color spaces in dal Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] ASoC: amd: amd_sdw: add machine driver quirk for Lenovo models Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-6.18] ALSA: hda/realtek: Fix headset mic on ASUS Zenbook 14 UX3405MA Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19] Drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT Sasha Levin
2026-02-23 12:37 ` [PATCH AUTOSEL 6.19-5.10] drm/amdgpu: Add HAINAN clock adjustment Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260223123738.1532940-9-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Mangesh.Gadre@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=lijo.lazar@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.