From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: "Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
patches@lists.linux.dev,
"Christian König" <christian.koenig@amd.com>,
"Guchun Chen" <guchun.chen@amd.com>,
"Luben Tuikov" <luben.tuikov@amd.com>,
"Mario Limonciello" <mario.limonciello@amd.com>,
"Guilherme G. Piccoli" <gpiccoli@igalia.com>,
"Alex Deucher" <alexander.deucher@amd.com>
Subject: [PATCH 5.15 64/67] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Date: Mon, 13 Feb 2023 15:49:45 +0100 [thread overview]
Message-ID: <20230213144735.441231203@linuxfoundation.org> (raw)
In-Reply-To: <20230213144732.336342050@linuxfoundation.org>
From: Guilherme G. Piccoli <gpiccoli@igalia.com>
commit 5ad7bbf3dba5c4a684338df1f285080f2588b535 upstream.
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.
Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:
amdgpu: probe of 0000:04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0000000000000090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
<TASK>
amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
devm_drm_dev_init_release+0x49/0x70
[...]
To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.
Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0], and also removed the no_scheduler check [1].
[0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
[1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/
Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
Suggested-by: Christian König <christian.koenig@amd.com>
Cc: Guchun Chen <guchun.chen@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -579,7 +579,13 @@ void amdgpu_fence_driver_sw_fini(struct
if (!ring || !ring->fence_drv.initialized)
continue;
- if (!ring->no_scheduler)
+ /*
+ * Notice we check for sched.ops since there's some
+ * override on the meaning of sched.ready by amdgpu.
+ * The natural check would be sched.ready, which is
+ * set as drm_sched_init() finishes...
+ */
+ if (ring->sched.ops)
drm_sched_fini(&ring->sched);
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
next prev parent reply other threads:[~2023-02-13 15:00 UTC|newest]
Thread overview: 76+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-13 14:48 [PATCH 5.15 00/67] 5.15.94-rc1 review Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 01/67] nvmem: core: add error handling for dev_set_name Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 02/67] nvmem: core: fix cleanup after dev_set_name() Greg Kroah-Hartman
2023-02-14 12:56 ` Russell King (Oracle)
2023-02-13 14:48 ` [PATCH 5.15 03/67] nvmem: core: fix registration vs use race Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 04/67] mm/migration: return errno when isolate_huge_page failed Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 05/67] migrate: hugetlb: check for hugetlb shared PMD in node migration Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 06/67] btrfs: limit device extents to the device size Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 07/67] btrfs: zlib: zero-initialize zlib workspace Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 08/67] ALSA: hda/realtek: Add Positivo N14KP6-TG Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 09/67] ALSA: emux: Avoid potential array out-of-bound in snd_emux_xg_control() Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 10/67] ALSA: hda/realtek: Fix the speaker output on Samsung Galaxy Book2 Pro 360 Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 11/67] ALSA: hda/realtek: Enable mute/micmute LEDs on HP Elitebook, 645 G9 Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 12/67] tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 13/67] of/address: Return an error when no valid dma-ranges are found Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 14/67] can: j1939: do not wait 250 ms if the same addr was already claimed Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 15/67] xfrm: compat: change expression for switch in xfrm_xlate64 Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 16/67] IB/hfi1: Restore allocated resources on failed copyout Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 17/67] xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr() Greg Kroah-Hartman
2023-02-13 14:48 ` [PATCH 5.15 18/67] IB/IPoIB: Fix legacy IPoIB due to wrong number of queues Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 19/67] RDMA/irdma: Fix potential NULL-ptr-dereference Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 20/67] RDMA/usnic: use iommu_map_atomic() under spin_lock() Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 21/67] xfrm: fix bug with DSCP copy to v6 from v4 tunnel Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 22/67] net: phylink: move phy_device_free() to correctly release phy device Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 23/67] bonding: fix error checking in bond_debug_reregister() Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 24/67] net: phy: meson-gxl: use MMD access dummy stubs for GXL, internal PHY Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 25/67] ionic: clean interrupt before enabling queue to avoid credit race Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 26/67] uapi: add missing ip/ipv6 header dependencies for linux/stddef.h Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 27/67] ice: Do not use WQ_MEM_RECLAIM flag for workqueue Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 28/67] net: dsa: mt7530: dont change PVC_EG_TAG when CPU port becomes VLAN-aware Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 29/67] net: mscc: ocelot: fix VCAP filters not matching on MAC with "protocol 802.1Q" Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 30/67] net/mlx5e: Move repeating clear_bit in mlx5e_rx_reporter_err_rq_cqe_recover Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 31/67] net/mlx5e: Introduce the mlx5e_flush_rq function Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 32/67] net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 33/67] net/mlx5: Bridge, fix ageing of peer FDB entries Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 34/67] net/mlx5e: IPoIB, Show unknown speed instead of error Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 35/67] net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 36/67] net/mlx5: fw_tracer, Zero consumer index when reloading the tracer Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 37/67] net/mlx5: Serialize module cleanup with reload and remove Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 38/67] igc: Add ndo_tx_timeout support Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 39/67] rds: rds_rm_zerocopy_callback() use list_first_entry() Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 40/67] selftests: forwarding: lib: quote the sysctl values Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 41/67] ALSA: pci: lx6464es: fix a debug loop Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 42/67] riscv: stacktrace: Fix missing the first frame Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 43/67] ASoC: topology: Return -ENOMEM on memory allocation failure Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 44/67] pinctrl: mediatek: Fix the drive register definition of some Pins Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 45/67] pinctrl: aspeed: Fix confusing types in return value Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 46/67] pinctrl: single: fix potential NULL dereference Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 47/67] spi: dw: Fix wrong FIFO level setting for long xfers Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 48/67] pinctrl: intel: Restore the pins that used to be in Direct IRQ mode Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 49/67] cifs: Fix use-after-free in rdata->read_into_pages() Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 50/67] net: USB: Fix wrong-direction WARNING in plusb.c Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 51/67] mptcp: be careful on subflow status propagation on errors Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 52/67] btrfs: free device in btrfs_close_devices for a single device filesystem Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 53/67] usb: core: add quirk for Alcor Link AK9563 smartcard reader Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 54/67] usb: typec: altmodes/displayport: Fix probe pin assign check Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 55/67] clk: ingenic: jz4760: Update M/N/OD calculation algorithm Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 56/67] ceph: flush cap releases when the session is flushed Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 57/67] riscv: Fixup race condition on PG_dcache_clean in flush_icache_pte Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 58/67] powerpc/64s/interrupt: Fix interrupt exit race with security mitigation switch Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 59/67] rtmutex: Ensure that the top waiter is always woken up Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 60/67] arm64: dts: meson-gx: Make mmc host controller interrupts level-sensitive Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 61/67] arm64: dts: meson-g12-common: " Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 62/67] arm64: dts: meson-axg: " Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 63/67] Fix page corruption caused by racy check in __free_pages Greg Kroah-Hartman
2023-02-13 14:49 ` Greg Kroah-Hartman [this message]
2023-02-13 14:49 ` [PATCH 5.15 65/67] drm/i915: Initialize the obj flags for shmem objects Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 66/67] drm/i915: Fix VBT DSI DVO port handling Greg Kroah-Hartman
2023-02-13 14:49 ` [PATCH 5.15 67/67] nvmem: core: fix return value Greg Kroah-Hartman
2023-02-13 20:07 ` [PATCH 5.15 00/67] 5.15.94-rc1 review Florian Fainelli
2023-02-13 22:08 ` Allen Pais
2023-02-13 23:31 ` Shuah Khan
2023-02-14 3:04 ` Bagas Sanjaya
2023-02-14 8:26 ` Naresh Kamboju
2023-02-14 10:54 ` Sudip Mukherjee (Codethink)
2023-02-14 12:46 ` Ron Economos
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230213144735.441231203@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=alexander.deucher@amd.com \
--cc=christian.koenig@amd.com \
--cc=gpiccoli@igalia.com \
--cc=guchun.chen@amd.com \
--cc=luben.tuikov@amd.com \
--cc=mario.limonciello@amd.com \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).