* [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done()
@ 2025-04-10 9:24 Philipp Stanner
2025-04-10 9:24 ` [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list Philipp Stanner
` (3 more replies)
0 siblings, 4 replies; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 9:24 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal, Christian König
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, Philipp Stanner
Contains two patches improving nouveau_fence_done(), and one addressing
an actual bug (race):
[ 39.848463] WARNING: CPU: 21 PID: 1734 at drivers/gpu/drm/nouveau/nouveau_fence.c:509 nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc snd_sof_pci_intel_
tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda snd_sof snd_sof_utils snd
_soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led snd_soc_hda_codec intel_rapl_msr snd_hda_
codec_realtek snd_hda_ext_core intel_rapl_common snd_hda_codec_generic snd_soc_core snd_hda_scodec_component intel_uncore_frequency intel_uncore_frequency_common snd_hd
a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc snd_hwdep snd_hda_core snd_seq sn
d_seq_device dell_wmi
[ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor platform_profile sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp cxl_port iTCO_wdt mtd rapl intel
_pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec e1000e macsec mei_me i2c_i801
spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched polyval_generic rtsx_pci_sdmm
c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3 nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec nvme_core idxd_bus rtsx_pci nvme_au
th pinctrl_alderlake ip6_tables ip_tables fuse
[ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted: G W 6.14.0-rc4+ #11
[ 39.848605] Tainted: [W]=WARN
[ 39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6, BIOS 2.7.0 12/17/2024
[ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5 f0 eb 96 <0f> 0b e9 67 ff ff f
f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
[ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
[ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX: ff175a3b4801e008
[ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI: ff175a3b504da980
[ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09: 0000000000000001
[ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12: ff175a3b6d97de00
[ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15: 0000000000000001
[ 39.848696] FS: 00007fc5477846c0(0000) GS:ff175a5a50280000(0000) knlGS:0000000000000000
[ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4: 0000000000f71ef0
[ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 39.848702] PKRU: 55555554
[ 39.848703] Call Trace:
[ 39.848704] <TASK>
[ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[ 39.848782] ? __warn.cold+0x93/0xfa
[ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[ 39.848861] ? report_bug+0xff/0x140
[ 39.848863] ? handle_bug+0x58/0x90
[ 39.848865] ? exc_invalid_op+0x17/0x70
[ 39.848866] ? asm_exc_invalid_op+0x1a/0x20
[ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[ 39.848943] nouveau_fence_enable_signaling+0x32/0x80 [nouveau]
[ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau]
[ 39.849088] __dma_fence_enable_signaling+0x33/0xc0
[ 39.849090] dma_fence_add_callback+0x4b/0xd0
[ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau]
[ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau]
[ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
[ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
[ 39.849431] drm_ioctl_kernel+0xad/0x100
[ 39.849433] drm_ioctl+0x288/0x550
[ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
[ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau]
[ 39.849620] __x64_sys_ioctl+0x94/0xc0
[ 39.849621] do_syscall_64+0x82/0x160
[ 39.849623] ? drm_ioctl+0x2b7/0x550
[ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
[ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0
[ 39.849721] ? __pm_runtime_suspend+0x69/0xc0
[ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200
[ 39.849729] ? do_syscall_64+0x8e/0x160
[ 39.849730] ? exc_page_fault+0x7e/0x1a0
[ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 39.849735] RIP: 0033:0x7fc5576fe0ad
[ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX: 00007fc5576fe0ad
[ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI: 000000000000000e
[ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09: 000055cb74e35560
[ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12: 00007ffc00268960
[ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15: 000055cb74e3cd10
[ 39.849746] </TASK>
[ 39.849746] ---[ end trace 0000000000000000 ]---
[ 39.849776] ------------[ cut here ]------------
This is the first WARN_ON() in dma_fence_set_error(), called by
nouveau_fence_context_kill().
It's rare, but it is a bug, or rather: the archetype of a race, since
(as Christian pointed out) nouveau_fence_update() later at some point
will remove the signaled fence (by signaling it again).
P.
Philipp Stanner (3):
drm/nouveau: Prevent signaled fences in pending list
drm/nouveau: Remove surplus if-branch
drm/nouveau: Add helper to check base fence
drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++-----------
1 file changed, 18 insertions(+), 14 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 9:24 [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
@ 2025-04-10 9:24 ` Philipp Stanner
2025-04-10 12:13 ` Christian König
2025-04-10 12:58 ` Christian König
2025-04-10 9:24 ` [PATCH 2/3] drm/nouveau: Remove surplus if-branch Philipp Stanner
` (2 subsequent siblings)
3 siblings, 2 replies; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 9:24 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal, Christian König
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, Philipp Stanner, stable
Nouveau currently relies on the assumption that dma_fences will only
ever get signaled through nouveau_fence_signal(), which takes care of
removing a signaled fence from the list nouveau_fence_chan.pending.
This self-imposed rule is violated in nouveau_fence_done(), where
dma_fence_is_signaled() (somewhat surprisingly, considering its name)
can signal the fence without removing it from the list. This enables
accesses to already signaled fences through the list, which is a bug.
In particular, it can race with nouveau_fence_context_kill(), which
would then attempt to set an error code on an already signaled fence,
which is illegal.
In nouveau_fence_done(), the call to nouveau_fence_update() already
ensures to signal all ready fences. Thus, the signaling potentially
performed by dma_fence_is_signaled() is actually not necessary.
Replace the call to dma_fence_is_signaled() with
nouveau_fence_base_is_signaled().
Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be determined
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 7cc84472cece..33535987d8ed 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
nvif_event_block(&fctx->event);
spin_unlock_irqrestore(&fctx->lock, flags);
}
- return dma_fence_is_signaled(&fence->base);
+ return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
}
static long
--
2.48.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/3] drm/nouveau: Remove surplus if-branch
2025-04-10 9:24 [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
2025-04-10 9:24 ` [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list Philipp Stanner
@ 2025-04-10 9:24 ` Philipp Stanner
2025-04-10 12:15 ` Christian König
2025-04-10 9:24 ` [PATCH 3/3] drm/nouveau: Add helper to check base fence Philipp Stanner
2025-04-10 9:51 ` [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
3 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 9:24 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal, Christian König
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, Philipp Stanner
nouveau_fence_done() contains an if-branch which checks for the
existence of either of two fence backend ops. Those two are the only
backend ops existing in Nouveau, however; and at least one backend ops
must be in use for the entire driver to be able to work. The if branch
is, therefore, surplus.
Remove the if-branch.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_fence.c | 24 +++++++++++-------------
1 file changed, 11 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 33535987d8ed..db6f4494405c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -259,21 +259,19 @@ nouveau_fence_emit(struct nouveau_fence *fence)
bool
nouveau_fence_done(struct nouveau_fence *fence)
{
- if (fence->base.ops == &nouveau_fence_ops_legacy ||
- fence->base.ops == &nouveau_fence_ops_uevent) {
- struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
- struct nouveau_channel *chan;
- unsigned long flags;
+ struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
+ struct nouveau_channel *chan;
+ unsigned long flags;
- if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
- return true;
+ if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
+ return true;
+
+ spin_lock_irqsave(&fctx->lock, flags);
+ chan = rcu_dereference_protected(fence->channel, lockdep_is_held(&fctx->lock));
+ if (chan && nouveau_fence_update(chan, fctx))
+ nvif_event_block(&fctx->event);
+ spin_unlock_irqrestore(&fctx->lock, flags);
- spin_lock_irqsave(&fctx->lock, flags);
- chan = rcu_dereference_protected(fence->channel, lockdep_is_held(&fctx->lock));
- if (chan && nouveau_fence_update(chan, fctx))
- nvif_event_block(&fctx->event);
- spin_unlock_irqrestore(&fctx->lock, flags);
- }
return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
}
--
2.48.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/3] drm/nouveau: Add helper to check base fence
2025-04-10 9:24 [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
2025-04-10 9:24 ` [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list Philipp Stanner
2025-04-10 9:24 ` [PATCH 2/3] drm/nouveau: Remove surplus if-branch Philipp Stanner
@ 2025-04-10 9:24 ` Philipp Stanner
2025-04-10 9:51 ` [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
3 siblings, 0 replies; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 9:24 UTC (permalink / raw)
To: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal, Christian König
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, Philipp Stanner
Nouveau, unfortunately, checks whether a dma_fence is already siganled
at various different places with, at times, different methods. In
nouveau_fence_update() it generally signals all fences the hardware is
done with by evaluating the sequence number. That mechanism then has no
way to tell the caller nouveau_fence_done() whether a particular fence
is actually signaled, which is why the internal bits of the dma_fence
get checked.
This can be made more readable by providing a new wrapper, which can
then later be helpful to solve an unrelated bug.
Add nouveau_fence_base_is_signaled().
Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_fence.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index db6f4494405c..0d58a81b3402 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -256,6 +256,12 @@ nouveau_fence_emit(struct nouveau_fence *fence)
return ret;
}
+static inline bool
+nouveau_fence_base_is_signaled(struct nouveau_fence *fence)
+{
+ return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
+}
+
bool
nouveau_fence_done(struct nouveau_fence *fence)
{
@@ -263,7 +269,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
struct nouveau_channel *chan;
unsigned long flags;
- if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
+ if (nouveau_fence_base_is_signaled(fence))
return true;
spin_lock_irqsave(&fctx->lock, flags);
@@ -272,7 +278,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
nvif_event_block(&fctx->event);
spin_unlock_irqrestore(&fctx->lock, flags);
- return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
+ return nouveau_fence_base_is_signaled(fence);
}
static long
--
2.48.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done()
2025-04-10 9:24 [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
` (2 preceding siblings ...)
2025-04-10 9:24 ` [PATCH 3/3] drm/nouveau: Add helper to check base fence Philipp Stanner
@ 2025-04-10 9:51 ` Philipp Stanner
2025-04-10 12:18 ` Christian König
3 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 9:51 UTC (permalink / raw)
To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Sabrina Dubroca, Sumit Semwal,
Christian König
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig
On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
> Contains two patches improving nouveau_fence_done(), and one
> addressing
> an actual bug (race):
Oops, that's the wrong calltrace. Here we go:
[ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791874] ? __warn.cold (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791950] ? report_bug (/home/imperator/linux/lib/bug.c:180 /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ 85.791960] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ 85.792033] ? drm_sched_entity_kill.part.0 (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) nouveau [ 85.792191] nouveau_abi16_fini (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) nouveau [ 85.792349] drm_file_free (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] drm_release (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-arch-fallback.h:2278 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-instrumented.h:1384 (discriminator 1) /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) [ 85.792357] task_work_run (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) [ 85.792366] arch_do_signal_or_restart (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 /home/imperator/linux/arch/x86/kernel/signal.c:264 /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] syscall_exit_to_user_mode (/home/imperator/linux/kernel/entry/common.c:113 /home/imperator/linux/./include/linux/entry-common.h:329 /home/imperator/linux/kernel/entry/common.c:207 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? syscall_exit_to_user_mode_prepare (/home/imperator/linux/./include/linux/audit.h:357 /home/imperator/linux/kernel/entry/common.c:166 /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? syscall_exit_to_user_mode (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 /home/imperator/linux/./include/linux/entry-common.h:232 /home/imperator/linux/kernel/entry/common.c:206 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] entry_SYSCALL_64_after_hwframe (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such file Code starting with the faulting instruction =========================================== [ 85.792383] RSP: 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ 85.792391] ---[ end trace 0000000000000000 ]---
By the way, for reference:
I did try whether it could be done to have nouveau_fence_signal()
incorporated into nouveau_fence_update() and nouveau_fence_done().
This, however, would then cause a race with the list_del() in
nouveau_fence_no_signaling(), WARNing because of the list poison.
So the "solution" space is:
* A cleanup callback on the dma_fence.
* Keeping the current race or
* replacing it with another race with another function.
* Just preventing nouveau_fence_done() from signaling fences other
than through nouveau_fence_update/signal
The later seems clearly like the cleanest solution to me. Alternative
would be a work-intensive rework of all the misdesigns broken in
nouveau_fence.c
P.
>
> [ 39.848463] WARNING: CPU: 21 PID: 1734 at
> drivers/gpu/drm/nouveau/nouveau_fence.c:509
> nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
> nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
> t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
> nf_tables qrtr sunrpc snd_sof_pci_intel_
> tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
> snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
> snd_sof_intel_hda snd_sof snd_sof_utils snd
> _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
> snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
> snd_soc_hda_codec intel_rapl_msr snd_hda_
> codec_realtek snd_hda_ext_core intel_rapl_common
> snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
> intel_uncore_frequency intel_uncore_frequency_common snd_hd
> a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit
> snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc
> snd_hwdep snd_hda_core snd_seq sn
> d_seq_device dell_wmi
> [ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor platform_profile
> sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
> cxl_port iTCO_wdt mtd rapl intel
> _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
> iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
> dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
> n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore
> einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec
> e1000e macsec mei_me i2c_i801
> spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink
> zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched
> polyval_generic rtsx_pci_sdmm
> c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3
> nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
> nvme_core idxd_bus rtsx_pci nvme_au
> th pinctrl_alderlake ip6_tables ip_tables fuse
> [ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted:
> G W 6.14.0-rc4+ #11
> [ 39.848605] Tainted: [W]=WARN
> [ 39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6,
> BIOS 2.7.0 12/17/2024
> [ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
> [nouveau]
> [ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43
> 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5
> f0 eb 96 <0f> 0b e9 67 ff ff f
> f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
> [ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
> [ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
> ff175a3b4801e008
> [ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
> ff175a3b504da980
> [ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
> 0000000000000001
> [ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
> ff175a3b6d97de00
> [ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
> 0000000000000001
> [ 39.848696] FS: 00007fc5477846c0(0000) GS:ff175a5a50280000(0000)
> knlGS:0000000000000000
> [ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
> 0000000000f71ef0
> [ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> 0000000000000400
> [ 39.848702] PKRU: 55555554
> [ 39.848703] Call Trace:
> [ 39.848704] <TASK>
> [ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [ 39.848782] ? __warn.cold+0x93/0xfa
> [ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [ 39.848861] ? report_bug+0xff/0x140
> [ 39.848863] ? handle_bug+0x58/0x90
> [ 39.848865] ? exc_invalid_op+0x17/0x70
> [ 39.848866] ? asm_exc_invalid_op+0x1a/0x20
> [ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [ 39.848943] nouveau_fence_enable_signaling+0x32/0x80 [nouveau]
> [ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau]
> [ 39.849088] __dma_fence_enable_signaling+0x33/0xc0
> [ 39.849090] dma_fence_add_callback+0x4b/0xd0
> [ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau]
> [ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau]
> [ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
> [ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
> [ 39.849431] drm_ioctl_kernel+0xad/0x100
> [ 39.849433] drm_ioctl+0x288/0x550
> [ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
> [ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau]
> [ 39.849620] __x64_sys_ioctl+0x94/0xc0
> [ 39.849621] do_syscall_64+0x82/0x160
> [ 39.849623] ? drm_ioctl+0x2b7/0x550
> [ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
> [ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0
> [ 39.849721] ? __pm_runtime_suspend+0x69/0xc0
> [ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
> [ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200
> [ 39.849729] ? do_syscall_64+0x8e/0x160
> [ 39.849730] ? exc_page_fault+0x7e/0x1a0
> [ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 39.849735] RIP: 0033:0x7fc5576fe0ad
> [ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10
> c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00
> 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28
> 00 00 00
> [ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
> 00007fc5576fe0ad
> [ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
> 000000000000000e
> [ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
> 000055cb74e35560
> [ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
> 00007ffc00268960
> [ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
> 000055cb74e3cd10
> [ 39.849746] </TASK>
> [ 39.849746] ---[ end trace 0000000000000000 ]---
> [ 39.849776] ------------[ cut here ]------------
>
>
> This is the first WARN_ON() in dma_fence_set_error(), called by
> nouveau_fence_context_kill().
>
> It's rare, but it is a bug, or rather: the archetype of a race, since
> (as Christian pointed out) nouveau_fence_update() later at some point
> will remove the signaled fence (by signaling it again).
>
>
> P.
>
>
> Philipp Stanner (3):
> drm/nouveau: Prevent signaled fences in pending list
> drm/nouveau: Remove surplus if-branch
> drm/nouveau: Add helper to check base fence
>
> drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++---------
> --
> 1 file changed, 18 insertions(+), 14 deletions(-)
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 9:24 ` [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list Philipp Stanner
@ 2025-04-10 12:13 ` Christian König
2025-04-10 12:21 ` Danilo Krummrich
2025-04-10 12:58 ` Christian König
1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2025-04-10 12:13 UTC (permalink / raw)
To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
Am 10.04.25 um 11:24 schrieb Philipp Stanner:
> Nouveau currently relies on the assumption that dma_fences will only
> ever get signaled through nouveau_fence_signal(), which takes care of
> removing a signaled fence from the list nouveau_fence_chan.pending.
>
> This self-imposed rule is violated in nouveau_fence_done(), where
> dma_fence_is_signaled() (somewhat surprisingly, considering its name)
> can signal the fence without removing it from the list. This enables
> accesses to already signaled fences through the list, which is a bug.
>
> In particular, it can race with nouveau_fence_context_kill(), which
> would then attempt to set an error code on an already signaled fence,
> which is illegal.
>
> In nouveau_fence_done(), the call to nouveau_fence_update() already
> ensures to signal all ready fences. Thus, the signaling potentially
> performed by dma_fence_is_signaled() is actually not necessary.
>
> Replace the call to dma_fence_is_signaled() with
> nouveau_fence_base_is_signaled().
>
> Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be determined
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
> index 7cc84472cece..33535987d8ed 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
> nvif_event_block(&fctx->event);
> spin_unlock_irqrestore(&fctx->lock, flags);
> }
> - return dma_fence_is_signaled(&fence->base);
> + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
See the code above that:
if (fence->base.ops == &nouveau_fence_ops_legacy ||
fence->base.ops == &nouveau_fence_ops_uevent) {
....
Nouveau first tests if it's one of it's own fences, and if yes does some special handling. E.g. checking the fence status bits etc...
So this dma_fence_is_signaled() is for all non-nouveau fences and then not touching the internal flags is perfectly correct as far as I can see.
Regards,
Christian.
> }
>
> static long
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] drm/nouveau: Remove surplus if-branch
2025-04-10 9:24 ` [PATCH 2/3] drm/nouveau: Remove surplus if-branch Philipp Stanner
@ 2025-04-10 12:15 ` Christian König
0 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2025-04-10 12:15 UTC (permalink / raw)
To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig
Am 10.04.25 um 11:24 schrieb Philipp Stanner:
> nouveau_fence_done() contains an if-branch which checks for the
> existence of either of two fence backend ops. Those two are the only
> backend ops existing in Nouveau, however; and at least one backend ops
> must be in use for the entire driver to be able to work. The if branch
> is, therefore, surplus.
>
> Remove the if-branch.
What happens here is that nouveau checks if the fence comes from itself or some external source.
So when you remove that check you potentially illegally uses nouveau_fctx() on a non-nouveau fence.
Regards,
Christian.
>
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> drivers/gpu/drm/nouveau/nouveau_fence.c | 24 +++++++++++-------------
> 1 file changed, 11 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
> index 33535987d8ed..db6f4494405c 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> @@ -259,21 +259,19 @@ nouveau_fence_emit(struct nouveau_fence *fence)
> bool
> nouveau_fence_done(struct nouveau_fence *fence)
> {
> - if (fence->base.ops == &nouveau_fence_ops_legacy ||
> - fence->base.ops == &nouveau_fence_ops_uevent) {
> - struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
> - struct nouveau_channel *chan;
> - unsigned long flags;
> + struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
> + struct nouveau_channel *chan;
> + unsigned long flags;
>
> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
> - return true;
> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags))
> + return true;
> +
> + spin_lock_irqsave(&fctx->lock, flags);
> + chan = rcu_dereference_protected(fence->channel, lockdep_is_held(&fctx->lock));
> + if (chan && nouveau_fence_update(chan, fctx))
> + nvif_event_block(&fctx->event);
> + spin_unlock_irqrestore(&fctx->lock, flags);
>
> - spin_lock_irqsave(&fctx->lock, flags);
> - chan = rcu_dereference_protected(fence->channel, lockdep_is_held(&fctx->lock));
> - if (chan && nouveau_fence_update(chan, fctx))
> - nvif_event_block(&fctx->event);
> - spin_unlock_irqrestore(&fctx->lock, flags);
> - }
> return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
> }
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done()
2025-04-10 9:51 ` [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
@ 2025-04-10 12:18 ` Christian König
2025-04-10 13:18 ` Philipp Stanner
0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2025-04-10 12:18 UTC (permalink / raw)
To: phasta, Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig
Am 10.04.25 um 11:51 schrieb Philipp Stanner:
> On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
>> Contains two patches improving nouveau_fence_done(), and one
>> addressing
>> an actual bug (race):
> Oops, that's the wrong calltrace. Here we go:
>
> [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791874] ? __warn.cold (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791950] ? report_bug (/home/imperator/linux/lib/bug.c:180 /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ 85.791960] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ 85.792033] ? drm_sched_entity_kill.part.0 (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) nouveau [ 85.792191] nouveau_abi16_fini (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) nouveau [ 85.792349] drm_file_free (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] drm_release (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-arch-fallback.h:2278 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-instrumented.h:1384 (discriminator 1) /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) [ 85.792357] task_work_run (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) [ 85.792366] arch_do_signal_or_restart (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 /home/imperator/linux/arch/x86/kernel/signal.c:264 /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] syscall_exit_to_user_mode (/home/imperator/linux/kernel/entry/common.c:113 /home/imperator/linux/./include/linux/entry-common.h:329 /home/imperator/linux/kernel/entry/common.c:207 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? syscall_exit_to_user_mode_prepare (/home/imperator/linux/./include/linux/audit.h:357 /home/imperator/linux/kernel/entry/common.c:166 /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? syscall_exit_to_user_mode (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 /home/imperator/linux/./include/linux/entry-common.h:232 /home/imperator/linux/kernel/entry/common.c:206 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] entry_SYSCALL_64_after_hwframe (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such file Code starting with the faulting instruction =========================================== [ 85.792383] RSP: 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ 85.792391] ---[ end trace 0000000000000000 ]---
I think I understand the problem now as well, but that backtrace is completely mangled in the mail.
It would be nice if you could send that out again.
Thanks,
Christian.
>
> By the way, for reference:
> I did try whether it could be done to have nouveau_fence_signal()
> incorporated into nouveau_fence_update() and nouveau_fence_done().
> This, however, would then cause a race with the list_del() in
> nouveau_fence_no_signaling(), WARNing because of the list poison.
>
> So the "solution" space is:
> * A cleanup callback on the dma_fence.
> * Keeping the current race or
> * replacing it with another race with another function.
> * Just preventing nouveau_fence_done() from signaling fences other
> than through nouveau_fence_update/signal
>
> The later seems clearly like the cleanest solution to me. Alternative
> would be a work-intensive rework of all the misdesigns broken in
> nouveau_fence.c
>
>
> P.
>
>> [ 39.848463] WARNING: CPU: 21 PID: 1734 at
>> drivers/gpu/drm/nouveau/nouveau_fence.c:509
>> nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
>> nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
>> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
>> t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
>> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
>> nf_tables qrtr sunrpc snd_sof_pci_intel_
>> tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
>> snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
>> snd_sof_intel_hda snd_sof snd_sof_utils snd
>> _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
>> snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
>> snd_soc_hda_codec intel_rapl_msr snd_hda_
>> codec_realtek snd_hda_ext_core intel_rapl_common
>> snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
>> intel_uncore_frequency intel_uncore_frequency_common snd_hd
>> a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit
>> snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc
>> snd_hwdep snd_hda_core snd_seq sn
>> d_seq_device dell_wmi
>> [ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor platform_profile
>> sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
>> cxl_port iTCO_wdt mtd rapl intel
>> _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
>> iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
>> dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
>> n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore
>> einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec
>> e1000e macsec mei_me i2c_i801
>> spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink
>> zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched
>> polyval_generic rtsx_pci_sdmm
>> c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3
>> nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
>> nvme_core idxd_bus rtsx_pci nvme_au
>> th pinctrl_alderlake ip6_tables ip_tables fuse
>> [ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted:
>> G W 6.14.0-rc4+ #11
>> [ 39.848605] Tainted: [W]=WARN
>> [ 39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6,
>> BIOS 2.7.0 12/17/2024
>> [ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
>> [nouveau]
>> [ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43
>> 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5
>> f0 eb 96 <0f> 0b e9 67 ff ff f
>> f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
>> [ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
>> [ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
>> ff175a3b4801e008
>> [ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
>> ff175a3b504da980
>> [ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
>> 0000000000000001
>> [ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
>> ff175a3b6d97de00
>> [ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
>> 0000000000000001
>> [ 39.848696] FS: 00007fc5477846c0(0000) GS:ff175a5a50280000(0000)
>> knlGS:0000000000000000
>> [ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
>> 0000000000f71ef0
>> [ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
>> 0000000000000400
>> [ 39.848702] PKRU: 55555554
>> [ 39.848703] Call Trace:
>> [ 39.848704] <TASK>
>> [ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [ 39.848782] ? __warn.cold+0x93/0xfa
>> [ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [ 39.848861] ? report_bug+0xff/0x140
>> [ 39.848863] ? handle_bug+0x58/0x90
>> [ 39.848865] ? exc_invalid_op+0x17/0x70
>> [ 39.848866] ? asm_exc_invalid_op+0x1a/0x20
>> [ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [ 39.848943] nouveau_fence_enable_signaling+0x32/0x80 [nouveau]
>> [ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau]
>> [ 39.849088] __dma_fence_enable_signaling+0x33/0xc0
>> [ 39.849090] dma_fence_add_callback+0x4b/0xd0
>> [ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau]
>> [ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau]
>> [ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
>> [ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [ 39.849431] drm_ioctl_kernel+0xad/0x100
>> [ 39.849433] drm_ioctl+0x288/0x550
>> [ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau]
>> [ 39.849620] __x64_sys_ioctl+0x94/0xc0
>> [ 39.849621] do_syscall_64+0x82/0x160
>> [ 39.849623] ? drm_ioctl+0x2b7/0x550
>> [ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0
>> [ 39.849721] ? __pm_runtime_suspend+0x69/0xc0
>> [ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
>> [ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200
>> [ 39.849729] ? do_syscall_64+0x8e/0x160
>> [ 39.849730] ? exc_page_fault+0x7e/0x1a0
>> [ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [ 39.849735] RIP: 0033:0x7fc5576fe0ad
>> [ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10
>> c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00
>> 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28
>> 00 00 00
>> [ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> [ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
>> 00007fc5576fe0ad
>> [ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
>> 000000000000000e
>> [ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
>> 000055cb74e35560
>> [ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
>> 00007ffc00268960
>> [ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
>> 000055cb74e3cd10
>> [ 39.849746] </TASK>
>> [ 39.849746] ---[ end trace 0000000000000000 ]---
>> [ 39.849776] ------------[ cut here ]------------
>>
>>
>> This is the first WARN_ON() in dma_fence_set_error(), called by
>> nouveau_fence_context_kill().
>>
>> It's rare, but it is a bug, or rather: the archetype of a race, since
>> (as Christian pointed out) nouveau_fence_update() later at some point
>> will remove the signaled fence (by signaling it again).
>>
>>
>> P.
>>
>>
>> Philipp Stanner (3):
>> drm/nouveau: Prevent signaled fences in pending list
>> drm/nouveau: Remove surplus if-branch
>> drm/nouveau: Add helper to check base fence
>>
>> drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++---------
>> --
>> 1 file changed, 18 insertions(+), 14 deletions(-)
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 12:13 ` Christian König
@ 2025-04-10 12:21 ` Danilo Krummrich
2025-04-10 12:42 ` Christian König
0 siblings, 1 reply; 23+ messages in thread
From: Danilo Krummrich @ 2025-04-10 12:21 UTC (permalink / raw)
To: Christian König
Cc: Philipp Stanner, Lyude Paul, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal, dri-devel, nouveau, linux-kernel,
netdev, linux-media, linaro-mm-sig, stable
On Thu, Apr 10, 2025 at 02:13:34PM +0200, Christian König wrote:
> Am 10.04.25 um 11:24 schrieb Philipp Stanner:
> > Nouveau currently relies on the assumption that dma_fences will only
> > ever get signaled through nouveau_fence_signal(), which takes care of
> > removing a signaled fence from the list nouveau_fence_chan.pending.
> >
> > This self-imposed rule is violated in nouveau_fence_done(), where
> > dma_fence_is_signaled() (somewhat surprisingly, considering its name)
> > can signal the fence without removing it from the list. This enables
> > accesses to already signaled fences through the list, which is a bug.
> >
> > In particular, it can race with nouveau_fence_context_kill(), which
> > would then attempt to set an error code on an already signaled fence,
> > which is illegal.
> >
> > In nouveau_fence_done(), the call to nouveau_fence_update() already
> > ensures to signal all ready fences. Thus, the signaling potentially
> > performed by dma_fence_is_signaled() is actually not necessary.
> >
> > Replace the call to dma_fence_is_signaled() with
> > nouveau_fence_base_is_signaled().
> >
> > Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be determined
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > index 7cc84472cece..33535987d8ed 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
> > nvif_event_block(&fctx->event);
> > spin_unlock_irqrestore(&fctx->lock, flags);
> > }
> > - return dma_fence_is_signaled(&fence->base);
> > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
>
> See the code above that:
>
> if (fence->base.ops == &nouveau_fence_ops_legacy ||
> fence->base.ops == &nouveau_fence_ops_uevent) {
I think this check is a bit pointless given that fence is already a struct
nouveau_fence. :)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 12:21 ` Danilo Krummrich
@ 2025-04-10 12:42 ` Christian König
0 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2025-04-10 12:42 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Philipp Stanner, Lyude Paul, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal, dri-devel, nouveau, linux-kernel,
netdev, linux-media, linaro-mm-sig, stable
Am 10.04.25 um 14:21 schrieb Danilo Krummrich:
> On Thu, Apr 10, 2025 at 02:13:34PM +0200, Christian König wrote:
>> Am 10.04.25 um 11:24 schrieb Philipp Stanner:
>>> Nouveau currently relies on the assumption that dma_fences will only
>>> ever get signaled through nouveau_fence_signal(), which takes care of
>>> removing a signaled fence from the list nouveau_fence_chan.pending.
>>>
>>> This self-imposed rule is violated in nouveau_fence_done(), where
>>> dma_fence_is_signaled() (somewhat surprisingly, considering its name)
>>> can signal the fence without removing it from the list. This enables
>>> accesses to already signaled fences through the list, which is a bug.
>>>
>>> In particular, it can race with nouveau_fence_context_kill(), which
>>> would then attempt to set an error code on an already signaled fence,
>>> which is illegal.
>>>
>>> In nouveau_fence_done(), the call to nouveau_fence_update() already
>>> ensures to signal all ready fences. Thus, the signaling potentially
>>> performed by dma_fence_is_signaled() is actually not necessary.
>>>
>>> Replace the call to dma_fence_is_signaled() with
>>> nouveau_fence_base_is_signaled().
>>>
>>> Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be determined
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>> ---
>>> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> index 7cc84472cece..33535987d8ed 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
>>> nvif_event_block(&fctx->event);
>>> spin_unlock_irqrestore(&fctx->lock, flags);
>>> }
>>> - return dma_fence_is_signaled(&fence->base);
>>> + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
>> See the code above that:
>>
>> if (fence->base.ops == &nouveau_fence_ops_legacy ||
>> fence->base.ops == &nouveau_fence_ops_uevent) {
> I think this check is a bit pointless given that fence is already a struct
> nouveau_fence. :)
Oh, good point. I totally missed that.
In this case that indeed doesn't make any sense at all.
(Unless somebody just blindly upcasted the structure, but I really hope that this isn't the case here).
Regards,
Christian.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 9:24 ` [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list Philipp Stanner
2025-04-10 12:13 ` Christian König
@ 2025-04-10 12:58 ` Christian König
2025-04-10 13:09 ` Philipp Stanner
1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2025-04-10 12:58 UTC (permalink / raw)
To: Philipp Stanner, Lyude Paul, Danilo Krummrich, David Airlie,
Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
Am 10.04.25 um 11:24 schrieb Philipp Stanner:
> Nouveau currently relies on the assumption that dma_fences will only
> ever get signaled through nouveau_fence_signal(), which takes care of
> removing a signaled fence from the list nouveau_fence_chan.pending.
>
> This self-imposed rule is violated in nouveau_fence_done(), where
> dma_fence_is_signaled() (somewhat surprisingly, considering its name)
> can signal the fence without removing it from the list. This enables
> accesses to already signaled fences through the list, which is a bug.
>
> In particular, it can race with nouveau_fence_context_kill(), which
> would then attempt to set an error code on an already signaled fence,
> which is illegal.
>
> In nouveau_fence_done(), the call to nouveau_fence_update() already
> ensures to signal all ready fences. Thus, the signaling potentially
> performed by dma_fence_is_signaled() is actually not necessary.
Ah, I now got what you are trying to do here! But that won't help.
The problem is it is perfectly valid for somebody external (e.g. other driver, TTM etc...) to call dma_fence_is_signaled() on a nouveau fence.
This will then in turn still signal the fence and leave it on the pending list and creating the problem you have.
Regards,
Christian.
>
> Replace the call to dma_fence_is_signaled() with
> nouveau_fence_base_is_signaled().
>
> Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be determined
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
> index 7cc84472cece..33535987d8ed 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
> nvif_event_block(&fctx->event);
> spin_unlock_irqrestore(&fctx->lock, flags);
> }
> - return dma_fence_is_signaled(&fence->base);
> + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->base.flags);
> }
>
> static long
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 12:58 ` Christian König
@ 2025-04-10 13:09 ` Philipp Stanner
2025-04-10 13:16 ` Christian König
0 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 13:09 UTC (permalink / raw)
To: Christian König, Philipp Stanner, Lyude Paul,
Danilo Krummrich, David Airlie, Simona Vetter, Sabrina Dubroca,
Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
On Thu, 2025-04-10 at 14:58 +0200, Christian König wrote:
> Am 10.04.25 um 11:24 schrieb Philipp Stanner:
> > Nouveau currently relies on the assumption that dma_fences will
> > only
> > ever get signaled through nouveau_fence_signal(), which takes care
> > of
> > removing a signaled fence from the list nouveau_fence_chan.pending.
> >
> > This self-imposed rule is violated in nouveau_fence_done(), where
> > dma_fence_is_signaled() (somewhat surprisingly, considering its
> > name)
> > can signal the fence without removing it from the list. This
> > enables
> > accesses to already signaled fences through the list, which is a
> > bug.
> >
> > In particular, it can race with nouveau_fence_context_kill(), which
> > would then attempt to set an error code on an already signaled
> > fence,
> > which is illegal.
> >
> > In nouveau_fence_done(), the call to nouveau_fence_update() already
> > ensures to signal all ready fences. Thus, the signaling potentially
> > performed by dma_fence_is_signaled() is actually not necessary.
>
> Ah, I now got what you are trying to do here! But that won't help.
>
> The problem is it is perfectly valid for somebody external (e.g.
> other driver, TTM etc...) to call dma_fence_is_signaled() on a
> nouveau fence.
>
> This will then in turn still signal the fence and leave it on the
> pending list and creating the problem you have.
Good to hear – precisely that then is the use case for a dma_fence
callback! ^_^ It guarantees that, no matter who signals a fence, no
matter at what place, a certain action will always be performed.
I can't think of any other mechanism which could guarantee that a
signaled fence immediately gets removed from nouveau's pending list,
other than the callbacks.
But seriously, I don't think that anyone does this currently, nor do I
think that anyone could get away with doing it without the entire
computer burning down.
P.
>
> Regards,
> Christian.
>
> >
> > Replace the call to dma_fence_is_signaled() with
> > nouveau_fence_base_is_signaled().
> >
> > Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be
> > determined
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > index 7cc84472cece..33535987d8ed 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
> > nvif_event_block(&fctx->event);
> > spin_unlock_irqrestore(&fctx->lock, flags);
> > }
> > - return dma_fence_is_signaled(&fence->base);
> > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence-
> > >base.flags);
> > }
> >
> > static long
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 13:09 ` Philipp Stanner
@ 2025-04-10 13:16 ` Christian König
2025-04-10 15:36 ` Philipp Stanner
0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2025-04-10 13:16 UTC (permalink / raw)
To: phasta, Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
[-- Attachment #1: Type: text/plain, Size: 3323 bytes --]
Am 10.04.25 um 15:09 schrieb Philipp Stanner:
> On Thu, 2025-04-10 at 14:58 +0200, Christian König wrote:
>> Am 10.04.25 um 11:24 schrieb Philipp Stanner:
>>> Nouveau currently relies on the assumption that dma_fences will
>>> only
>>> ever get signaled through nouveau_fence_signal(), which takes care
>>> of
>>> removing a signaled fence from the list nouveau_fence_chan.pending.
>>>
>>> This self-imposed rule is violated in nouveau_fence_done(), where
>>> dma_fence_is_signaled() (somewhat surprisingly, considering its
>>> name)
>>> can signal the fence without removing it from the list. This
>>> enables
>>> accesses to already signaled fences through the list, which is a
>>> bug.
>>>
>>> In particular, it can race with nouveau_fence_context_kill(), which
>>> would then attempt to set an error code on an already signaled
>>> fence,
>>> which is illegal.
>>>
>>> In nouveau_fence_done(), the call to nouveau_fence_update() already
>>> ensures to signal all ready fences. Thus, the signaling potentially
>>> performed by dma_fence_is_signaled() is actually not necessary.
>> Ah, I now got what you are trying to do here! But that won't help.
>>
>> The problem is it is perfectly valid for somebody external (e.g.
>> other driver, TTM etc...) to call dma_fence_is_signaled() on a
>> nouveau fence.
>>
>> This will then in turn still signal the fence and leave it on the
>> pending list and creating the problem you have.
> Good to hear – precisely that then is the use case for a dma_fence
> callback! ^_^ It guarantees that, no matter who signals a fence, no
> matter at what place, a certain action will always be performed.
>
> I can't think of any other mechanism which could guarantee that a
> signaled fence immediately gets removed from nouveau's pending list,
> other than the callbacks.
>
> But seriously, I don't think that anyone does this currently, nor do I
> think that anyone could get away with doing it without the entire
> computer burning down.
Yeah, I don't think that this is possible at the moment.
When you do stuff like that from the provider side you will always run into lifetime issues because in the signaling from interrupt case you then drop the last reference before the signaling is completed.
How about the attached (not even compile tested) patch? I think it should fix the issue.
Regards,
Christian.
>
> P.
>
>
>
>> Regards,
>> Christian.
>>
>>> Replace the call to dma_fence_is_signaled() with
>>> nouveau_fence_base_is_signaled().
>>>
>>> Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be
>>> determined
>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>> ---
>>> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> index 7cc84472cece..33535987d8ed 100644
>>> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>> @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence *fence)
>>> nvif_event_block(&fctx->event);
>>> spin_unlock_irqrestore(&fctx->lock, flags);
>>> }
>>> - return dma_fence_is_signaled(&fence->base);
>>> + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence-
>>>> base.flags);
>>> }
>>>
>>> static long
[-- Attachment #2: 0001-drm-nouveau-fix-and-cleanup-fence-handling.patch --]
[-- Type: text/x-patch, Size: 2670 bytes --]
From 165df36b603b37f6f1785ce359f7cd184db62196 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com>
Date: Thu, 10 Apr 2025 10:18:29 +0200
Subject: [PATCH] drm/nouveau: fix and cleanup fence handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The fence was not removed from the pending list when signaled from the
.signaled callback. Fix that and also remove the superflous
.enable_signaling callback.
Signed-off-by: Christian König <christian.koenig@amd.com>
---
drivers/gpu/drm/nouveau/nouveau_fence.c | 31 +++++++------------------
1 file changed, 8 insertions(+), 23 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
index 7cc84472cece..53c70ddef964 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fence.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
@@ -485,32 +485,18 @@ static bool nouveau_fence_is_signaled(struct dma_fence *f)
ret = (int)(fctx->read(chan) - fence->base.seqno) >= 0;
rcu_read_unlock();
- return ret;
-}
-
-static bool nouveau_fence_no_signaling(struct dma_fence *f)
-{
- struct nouveau_fence *fence = from_fence(f);
-
- /*
- * caller should have a reference on the fence,
- * else fence could get freed here
- */
- WARN_ON(kref_read(&fence->base.refcount) <= 1);
+ if (ret) {
+ /*
+ * caller should have a reference on the fence,
+ * else fence could get freed here
+ */
+ WARN_ON(kref_read(&fence->base.refcount) <= 1);
- /*
- * This needs uevents to work correctly, but dma_fence_add_callback relies on
- * being able to enable signaling. It will still get signaled eventually,
- * just not right away.
- */
- if (nouveau_fence_is_signaled(f)) {
list_del(&fence->head);
-
dma_fence_put(&fence->base);
- return false;
}
- return true;
+ return ret;
}
static void nouveau_fence_release(struct dma_fence *f)
@@ -525,7 +511,6 @@ static void nouveau_fence_release(struct dma_fence *f)
static const struct dma_fence_ops nouveau_fence_ops_legacy = {
.get_driver_name = nouveau_fence_get_get_driver_name,
.get_timeline_name = nouveau_fence_get_timeline_name,
- .enable_signaling = nouveau_fence_no_signaling,
.signaled = nouveau_fence_is_signaled,
.wait = nouveau_fence_wait_legacy,
.release = nouveau_fence_release
@@ -540,7 +525,7 @@ static bool nouveau_fence_enable_signaling(struct dma_fence *f)
if (!fctx->notify_ref++)
nvif_event_allow(&fctx->event);
- ret = nouveau_fence_no_signaling(f);
+ ret = nouveau_fence_is_signaled(f);
if (ret)
set_bit(DMA_FENCE_FLAG_USER_BITS, &fence->base.flags);
else if (!--fctx->notify_ref)
--
2.34.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done()
2025-04-10 12:18 ` Christian König
@ 2025-04-10 13:18 ` Philipp Stanner
0 siblings, 0 replies; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 13:18 UTC (permalink / raw)
To: Christian König, phasta, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig
On Thu, 2025-04-10 at 14:18 +0200, Christian König wrote:
> Am 10.04.25 um 11:51 schrieb Philipp Stanner:
> > On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
> > > Contains two patches improving nouveau_fence_done(), and one
> > > addressing
> > > an actual bug (race):
> > Oops, that's the wrong calltrace. Here we go:
> >
> > [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ?
> > nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.791874] ? __warn.cold
> > (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ?
> > nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.791950] ? report_bug
> > (/home/imperator/linux/lib/bug.c:180
> > /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug
> > (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ?
> > exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309
> > (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op
> > (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [
> > 85.791960] ? nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold
> > (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/schedu
> > ler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [
> > 85.792033] ? drm_sched_entity_kill.part.0
> > (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243
> > (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518)
> > nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188)
> > nouveau [ 85.792191] nouveau_abi16_fini
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224
> > (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240)
> > nouveau [ 85.792349] drm_file_free
> > (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353]
> > drm_release
> > (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67
> > (discriminator 1)
> > /home/imperator/linux/./include/linux/atomic/atomic-arch-
> > fallback.h:2278 (discriminator 1)
> > /home/imperator/linux/./include/linux/atomic/atomic-
> > instrumented.h:1384 (discriminator 1)
> > /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator
> > 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464)
> > [ 85.792357] task_work_run
> > (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit
> > (/home/imperator/linux/kernel/exit.c:939) [ 85.792362]
> > do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [
> > 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036)
> > [ 85.792366] arch_do_signal_or_restart
> > (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38
> > /home/imperator/linux/arch/x86/kernel/signal.c:264
> > /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369]
> > syscall_exit_to_user_mode
> > (/home/imperator/linux/kernel/entry/common.c:113
> > /home/imperator/linux/./include/linux/entry-common.h:329
> > /home/imperator/linux/kernel/entry/common.c:207
> > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372]
> > do_syscall_64
> > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172
> > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ?
> > syscall_exit_to_user_mode_prepare
> > (/home/imperator/linux/./include/linux/audit.h:357
> > /home/imperator/linux/kernel/entry/common.c:166
> > /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ?
> > syscall_exit_to_user_mode
> > (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686
> > /home/imperator/linux/./include/linux/entry-common.h:232
> > /home/imperator/linux/kernel/entry/common.c:206
> > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ?
> > do_syscall_64
> > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172
> > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378]
> > entry_SYSCALL_64_after_hwframe
> > (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381]
> > RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode
> > bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such
> > file Code starting with the faulting instruction
> > =========================================== [ 85.792383] RSP:
> > 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [
> > 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX:
> > 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI:
> > 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP:
> > 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [
> > 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12:
> > 0000000000000001 [ 85.792388] R13: 0000000000000000 R14:
> > 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [
> > 85.792391] ---[ end trace 0000000000000000 ]---
>
> I think I understand the problem now as well, but that backtrace is
> completely mangled in the mail.
>
> It would be nice if you could send that out again.
I really need to install Mutt soon..
Let's try it this way:
https://paste.debian.net/1368679/
P.
>
> Thanks,
> Christian.
>
> >
> > By the way, for reference:
> > I did try whether it could be done to have nouveau_fence_signal()
> > incorporated into nouveau_fence_update() and nouveau_fence_done().
> > This, however, would then cause a race with the list_del() in
> > nouveau_fence_no_signaling(), WARNing because of the list poison.
> >
> > So the "solution" space is:
> > * A cleanup callback on the dma_fence.
> > * Keeping the current race or
> > * replacing it with another race with another function.
> > * Just preventing nouveau_fence_done() from signaling fences other
> > than through nouveau_fence_update/signal
> >
> > The later seems clearly like the cleanest solution to me.
> > Alternative
> > would be a work-intensive rework of all the misdesigns broken in
> > nouveau_fence.c
> >
> >
> > P.
> >
> > > [ 39.848463] WARNING: CPU: 21 PID: 1734 at
> > > drivers/gpu/drm/nouveau/nouveau_fence.c:509
> > > nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
> > > nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
> > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
> > > t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
> > > nf_tables qrtr sunrpc snd_sof_pci_intel_
> > > tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
> > > snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
> > > snd_sof_intel_hda snd_sof snd_sof_utils snd
> > > _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
> > > snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
> > > snd_soc_hda_codec intel_rapl_msr snd_hda_
> > > codec_realtek snd_hda_ext_core intel_rapl_common
> > > snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
> > > intel_uncore_frequency intel_uncore_frequency_common snd_hd
> > > a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common
> > > nfit
> > > snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec
> > > binfmt_misc
> > > snd_hwdep snd_hda_core snd_seq sn
> > > d_seq_device dell_wmi
> > > [ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor
> > > platform_profile
> > > sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
> > > cxl_port iTCO_wdt mtd rapl intel
> > > _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
> > > iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
> > > dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
> > > n dell_wmi_descriptor firmware_attributes_class wmi_bmof
> > > intel_uncore
> > > einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common
> > > intel_vsec
> > > e1000e macsec mei_me i2c_i801
> > > spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop
> > > nfnetlink
> > > zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto
> > > gpu_sched
> > > polyval_generic rtsx_pci_sdmm
> > > c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm
> > > sha512_ssse3
> > > nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
> > > nvme_core idxd_bus rtsx_pci nvme_au
> > > th pinctrl_alderlake ip6_tables ip_tables fuse
> > > [ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell
> > > Tainted:
> > > G W 6.14.0-rc4+ #11
> > > [ 39.848605] Tainted: [W]=WARN
> > > [ 39.848606] Hardware name: Dell Inc. Precision 7960
> > > Tower/01G0M6,
> > > BIOS 2.7.0 12/17/2024
> > > [ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
> > > [nouveau]
> > > [ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1
> > > 43
> > > 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2
> > > c5
> > > f0 eb 96 <0f> 0b e9 67 ff ff f
> > > f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
> > > [ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
> > > [ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
> > > ff175a3b4801e008
> > > [ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
> > > ff175a3b504da980
> > > [ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
> > > 0000000000000001
> > > [ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
> > > ff175a3b6d97de00
> > > [ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
> > > 0000000000000001
> > > [ 39.848696] FS: 00007fc5477846c0(0000)
> > > GS:ff175a5a50280000(0000)
> > > knlGS:0000000000000000
> > > [ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
> > > 0000000000f71ef0
> > > [ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> > > 0000000000000400
> > > [ 39.848702] PKRU: 55555554
> > > [ 39.848703] Call Trace:
> > > [ 39.848704] <TASK>
> > > [ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848782] ? __warn.cold+0x93/0xfa
> > > [ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848861] ? report_bug+0xff/0x140
> > > [ 39.848863] ? handle_bug+0x58/0x90
> > > [ 39.848865] ? exc_invalid_op+0x17/0x70
> > > [ 39.848866] ? asm_exc_invalid_op+0x1a/0x20
> > > [ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [ 39.848943] nouveau_fence_enable_signaling+0x32/0x80
> > > [nouveau]
> > > [ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10
> > > [nouveau]
> > > [ 39.849088] __dma_fence_enable_signaling+0x33/0xc0
> > > [ 39.849090] dma_fence_add_callback+0x4b/0xd0
> > > [ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau]
> > > [ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau]
> > > [ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
> > > [ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [ 39.849431] drm_ioctl_kernel+0xad/0x100
> > > [ 39.849433] drm_ioctl+0x288/0x550
> > > [ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau]
> > > [ 39.849620] __x64_sys_ioctl+0x94/0xc0
> > > [ 39.849621] do_syscall_64+0x82/0x160
> > > [ 39.849623] ? drm_ioctl+0x2b7/0x550
> > > [ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0
> > > [ 39.849721] ? __pm_runtime_suspend+0x69/0xc0
> > > [ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
> > > [ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200
> > > [ 39.849729] ? do_syscall_64+0x8e/0x160
> > > [ 39.849730] ? exc_page_fault+0x7e/0x1a0
> > > [ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > [ 39.849735] RIP: 0033:0x7fc5576fe0ad
> > > [ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45
> > > 10
> > > c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00
> > > 00
> > > 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25
> > > 28
> > > 00 00 00
> > > [ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246
> > > ORIG_RAX:
> > > 0000000000000010
> > > [ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
> > > 00007fc5576fe0ad
> > > [ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
> > > 000000000000000e
> > > [ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
> > > 000055cb74e35560
> > > [ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
> > > 00007ffc00268960
> > > [ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
> > > 000055cb74e3cd10
> > > [ 39.849746] </TASK>
> > > [ 39.849746] ---[ end trace 0000000000000000 ]---
> > > [ 39.849776] ------------[ cut here ]------------
> > >
> > >
> > > This is the first WARN_ON() in dma_fence_set_error(), called by
> > > nouveau_fence_context_kill().
> > >
> > > It's rare, but it is a bug, or rather: the archetype of a race,
> > > since
> > > (as Christian pointed out) nouveau_fence_update() later at some
> > > point
> > > will remove the signaled fence (by signaling it again).
> > >
> > >
> > > P.
> > >
> > >
> > > Philipp Stanner (3):
> > > drm/nouveau: Prevent signaled fences in pending list
> > > drm/nouveau: Remove surplus if-branch
> > > drm/nouveau: Add helper to check base fence
> > >
> > > drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++-----
> > > ----
> > > --
> > > 1 file changed, 18 insertions(+), 14 deletions(-)
> > >
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 13:16 ` Christian König
@ 2025-04-10 15:36 ` Philipp Stanner
2025-04-11 9:29 ` Philipp Stanner
0 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-10 15:36 UTC (permalink / raw)
To: Christian König, phasta, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
On Thu, 2025-04-10 at 15:16 +0200, Christian König wrote:
> Am 10.04.25 um 15:09 schrieb Philipp Stanner:
> > On Thu, 2025-04-10 at 14:58 +0200, Christian König wrote:
> > > Am 10.04.25 um 11:24 schrieb Philipp Stanner:
> > > > Nouveau currently relies on the assumption that dma_fences will
> > > > only
> > > > ever get signaled through nouveau_fence_signal(), which takes
> > > > care
> > > > of
> > > > removing a signaled fence from the list
> > > > nouveau_fence_chan.pending.
> > > >
> > > > This self-imposed rule is violated in nouveau_fence_done(),
> > > > where
> > > > dma_fence_is_signaled() (somewhat surprisingly, considering its
> > > > name)
> > > > can signal the fence without removing it from the list. This
> > > > enables
> > > > accesses to already signaled fences through the list, which is
> > > > a
> > > > bug.
> > > >
> > > > In particular, it can race with nouveau_fence_context_kill(),
> > > > which
> > > > would then attempt to set an error code on an already signaled
> > > > fence,
> > > > which is illegal.
> > > >
> > > > In nouveau_fence_done(), the call to nouveau_fence_update()
> > > > already
> > > > ensures to signal all ready fences. Thus, the signaling
> > > > potentially
> > > > performed by dma_fence_is_signaled() is actually not necessary.
> > > Ah, I now got what you are trying to do here! But that won't
> > > help.
> > >
> > > The problem is it is perfectly valid for somebody external (e.g.
> > > other driver, TTM etc...) to call dma_fence_is_signaled() on a
> > > nouveau fence.
> > >
> > > This will then in turn still signal the fence and leave it on the
> > > pending list and creating the problem you have.
> > Good to hear – precisely that then is the use case for a dma_fence
> > callback! ^_^ It guarantees that, no matter who signals a fence, no
> > matter at what place, a certain action will always be performed.
> >
> > I can't think of any other mechanism which could guarantee that a
> > signaled fence immediately gets removed from nouveau's pending
> > list,
> > other than the callbacks.
> >
> > But seriously, I don't think that anyone does this currently, nor
> > do I
> > think that anyone could get away with doing it without the entire
> > computer burning down.
>
> Yeah, I don't think that this is possible at the moment.
>
> When you do stuff like that from the provider side you will always
> run into lifetime issues because in the signaling from interrupt case
> you then drop the last reference before the signaling is completed.
>
> How about the attached (not even compile tested) patch? I think it
> should fix the issue.
This patch looked correct enough for me to try it out on top of my
memleak fix series [1] (which seems to reveal all those problems
through races appearing due to the removal of the waitqueue in
nouveau_sched_fini()).
The code looked correct to me, but it still makes boom-boom, again
because two parties get their fingers onto list_del():
[paste in case my editor explodes again:
https://paste.debian.net/1368705/ ]
[ 41.681698] list_del corruption, ff31ae696cdc86a0->next is
LIST_POISON1 (dead000000000100)
[ 41.681720] ------------[ cut here ]------------
[ 41.681722] kernel BUG at lib/list_debug.c:56!
[ 41.681729] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 41.681732] CPU: 22 UID: 42 PID: 1733 Comm: gnome-shell Not tainted
6.14.0-rc4+ #11
[ 41.681735] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6,
BIOS 2.7.0 12/17/2024
[ 41.681737] RIP: 0010:__list_del_entry_valid_or_report+0x76/0xf0
[ 41.681743] Code: 75 66 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc 48
89 ef e8 4c e7 b0 ff 48 89 ea 48 89 de 48 c7 c7 38 fb b5 a0 e8 3a 6d 6b
ff <0f> 0b 4c 89 e7 e8 30 e7 b0 ff 4c 89 e2 48 89 de 48 c7 c7 70 fb b5
[ 41.681745] RSP: 0018:ff4fe30cc0f83b30 EFLAGS: 00010246
[ 41.681748] RAX: 000000000000004e RBX: ff31ae696cdc86a0 RCX:
0000000000000027
[ 41.681749] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
ff31ae8850321900
[ 41.681751] RBP: dead000000000100 R08: 0000000000000000 R09:
0000000000000000
[ 41.681752] R10: 7572726f63206c65 R11: 6c65645f7473696c R12:
dead000000000122
[ 41.681753] R13: ff31ae696cdc8662 R14: ff4fe30cc0f83cb8 R15:
00007f68b7f9a000
[ 41.681754] FS: 00007f68bd0396c0(0000) GS:ff31ae8850300000(0000)
knlGS:0000000000000000
[ 41.681756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.681757] CR2: 00005577caaad68c CR3: 000000010407c003 CR4:
0000000000f71ef0
[ 41.681758] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 41.681759] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ 41.681760] PKRU: 55555554
[ 41.681761] Call Trace:
[ 41.681763] <TASK>
[ 41.681764] ? __die_body.cold+0x19/0x27
[ 41.681768] ? die+0x2e/0x50
[ 41.681772] ? do_trap+0xca/0x110
[ 41.681775] ? do_error_trap+0x6a/0x90
[ 41.681776] ? __list_del_entry_valid_or_report+0x76/0xf0
[ 41.681778] ? exc_invalid_op+0x50/0x70
[ 41.681781] ? __list_del_entry_valid_or_report+0x76/0xf0
[ 41.681782] ? asm_exc_invalid_op+0x1a/0x20
[ 41.681788] ? __list_del_entry_valid_or_report+0x76/0xf0
[ 41.681789] nouveau_fence_is_signaled+0x47/0xc0 [nouveau]
[ 41.681961] dma_resv_iter_walk_unlocked.part.0+0xbd/0x170
[ 41.681966] dma_resv_test_signaled+0x53/0x100
[ 41.681969] ttm_bo_release+0x12d/0x2f0 [ttm]
[ 41.681979] nouveau_gem_object_del+0x54/0x80 [nouveau]
[ 41.682090] ttm_bo_vm_close+0x41/0x60 [ttm]
[ 41.682097] remove_vma+0x2c/0x70
[ 41.682100] vms_complete_munmap_vmas+0xd8/0x180
[ 41.682102] do_vmi_align_munmap+0x1d7/0x250
[ 41.682106] do_vmi_munmap+0xd0/0x170
[ 41.682109] __vm_munmap+0xb1/0x180
[ 41.682112] __x64_sys_munmap+0x1b/0x30
[ 41.682115] do_syscall_64+0x82/0x160
[ 41.682117] ? do_user_addr_fault+0x55a/0x7b0
[ 41.682121] ? exc_page_fault+0x7e/0x1a0
[ 41.682124] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 41.682127] RIP: 0033:0x7f68cceff02b
[ 41.682130] Code: 73 01 c3 48 8b 0d e5 6d 0f 00 f7 d8 64 89 01 48 83
c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 6d 0f 00 f7 d8 64 89 01 48
[ 41.682131] RSP: 002b:00007ffed8d00c08 EFLAGS: 00000206 ORIG_RAX:
000000000000000b
[ 41.682134] RAX: ffffffffffffffda RBX: 00005577ca99ef50 RCX:
00007f68cceff02b
[ 41.682135] RDX: 0000000000000000 RSI: 0000000000001000 RDI:
00007f68b7f9a000
[ 41.682136] RBP: 00007ffed8d00c50 R08: 00005577cacc4160 R09:
00005577caccf930
[ 41.682137] R10: 000199999996d999 R11: 0000000000000206 R12:
0000000000000000
[ 41.682138] R13: 00007ffed8d00c60 R14: 00005577caf6c550 R15:
0000000000000035
[ 41.682141] </TASK>
[ 41.682141] Modules linked in: nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill
ip_set nf_tables qrtr sunrpc snd_sof_pci_intel_tgl
snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
snd_sof_intel_hda snd_sof snd_sof_utils snd_soc_acpi_intel_match
snd_soc_acpi snd_soc_acpi_intel_sdca_quirks snd_sof_intel_hda_mlink
snd_soc_sdca snd_soc_avs snd_ctl_led intel_rapl_msr snd_soc_hda_codec
snd_hda_ext_core intel_rapl_common snd_hda_codec_realtek snd_soc_core
intel_uncore_frequency snd_hda_codec_generic
intel_uncore_frequency_common intel_ifs snd_hda_scodec_component
snd_hda_codec_hdmi i10nm_edac snd_compress skx_edac_common binfmt_misc
nfit snd_hda_intel snd_intel_dspcfg snd_hda_codec libnvdimm snd_hwdep
snd_hda_core snd_seq snd_seq_device x86_pkg_temp_thermal dell_pc
dell_wmi
[ 41.682195] dax_hmem platform_profile intel_powerclamp
sparse_keymap cxl_acpi snd_pcm cxl_port coretemp iTCO_wdt cxl_core
spi_nor intel_pmc_bxt dell_wmi_sysman rapl pmt_telemetry dell_smbios
iTCO_vendor_support pmt_class intel_cstate snd_timer vfat dcdbas
isst_if_mmio mtd dell_smm_hwmon dell_wmi_ddv dell_wmi_descriptor
intel_uncore firmware_attributes_class wmi_bmof atlantic fat einj
pcspkr isst_if_mbox_pci snd isst_if_common intel_vsec i2c_i801 mei_me
e1000e spi_intel_pci macsec soundcore i2c_smbus spi_intel mei joydev
loop nfnetlink zram nouveau drm_ttm_helper ttm iaa_crypto
polyval_clmulni rtsx_pci_sdmmc polyval_generic mmc_core gpu_sched
ghash_clmulni_intel i2c_algo_bit nvme sha512_ssse3 drm_gpuvm drm_exec
sha256_ssse3 idxd nvme_core sha1_ssse3 drm_display_helper rtsx_pci cec
nvme_auth idxd_bus pinctrl_alderlake ip6_tables ip_tables fuse
[ 41.682269] ---[ end trace 0000000000000000 ]---
[ 41.969442] RIP: 0010:__list_del_entry_valid_or_report+0x76/0xf0
[ 41.969458] Code: 75 66 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc 48
89 ef e8 4c e7 b0 ff 48 89 ea 48 89 de 48 c7 c7 38 fb b5 a0 e8 3a 6d 6b
ff <0f> 0b 4c 89 e7 e8 30 e7 b0 ff 4c 89 e2 48 89 de 48 c7 c7 70 fb b5
[ 41.969461] RSP: 0018:ff4fe30cc0f83b30 EFLAGS: 00010246
[ 41.969464] RAX: 000000000000004e RBX: ff31ae696cdc86a0 RCX:
0000000000000027
[ 41.969466] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
ff31ae8850321900
[ 41.969468] RBP: dead000000000100 R08: 0000000000000000 R09:
0000000000000000
[ 41.969469] R10: 7572726f63206c65 R11: 6c65645f7473696c R12:
dead000000000122
[ 41.969470] R13: ff31ae696cdc8662 R14: ff4fe30cc0f83cb8 R15:
00007f68b7f9a000
[ 41.969471] FS: 00007f68bd0396c0(0000) GS:ff31ae8850300000(0000)
knlGS:0000000000000000
[ 41.969473] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.969474] CR2: 00005577caaad68c CR3: 000000010407c003 CR4:
0000000000f71ef0
[ 41.969476] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 41.969477] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ 41.969478] PKRU: 55555554
I fail to see why exactly right now, but am also quite tired. Might
take another look the next days.
Although I'm not convinced that my solution is bad either. It's
Nouveau, after all. On this ranch a cowboy has to defend himself with
the pitchfork instead of the colt at times.
[1] https://lore.kernel.org/all/20250407152239.34429-2-phasta@kernel.org/
P.
>
> Regards,
> Christian.
>
> >
> > P.
> >
> >
> >
> > > Regards,
> > > Christian.
> > >
> > > > Replace the call to dma_fence_is_signaled() with
> > > > nouveau_fence_base_is_signaled().
> > > >
> > > > Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to be
> > > > determined
> > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > ---
> > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > index 7cc84472cece..33535987d8ed 100644
> > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence
> > > > *fence)
> > > > nvif_event_block(&fctx->event);
> > > > spin_unlock_irqrestore(&fctx->lock, flags);
> > > > }
> > > > - return dma_fence_is_signaled(&fence->base);
> > > > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence-
> > > > > base.flags);
> > > > }
> > > >
> > > > static long
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-10 15:36 ` Philipp Stanner
@ 2025-04-11 9:29 ` Philipp Stanner
[not found] ` <81a70ba6-94b1-4bb3-a0b2-9e8890f90b33@amd.com>
0 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-11 9:29 UTC (permalink / raw)
To: phasta, Christian König, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
On Thu, 2025-04-10 at 17:36 +0200, Philipp Stanner wrote:
> On Thu, 2025-04-10 at 15:16 +0200, Christian König wrote:
> > Am 10.04.25 um 15:09 schrieb Philipp Stanner:
> > > On Thu, 2025-04-10 at 14:58 +0200, Christian König wrote:
> > > > Am 10.04.25 um 11:24 schrieb Philipp Stanner:
> > > > > Nouveau currently relies on the assumption that dma_fences
> > > > > will
> > > > > only
> > > > > ever get signaled through nouveau_fence_signal(), which takes
> > > > > care
> > > > > of
> > > > > removing a signaled fence from the list
> > > > > nouveau_fence_chan.pending.
> > > > >
> > > > > This self-imposed rule is violated in nouveau_fence_done(),
> > > > > where
> > > > > dma_fence_is_signaled() (somewhat surprisingly, considering
> > > > > its
> > > > > name)
> > > > > can signal the fence without removing it from the list. This
> > > > > enables
> > > > > accesses to already signaled fences through the list, which
> > > > > is
> > > > > a
> > > > > bug.
> > > > >
> > > > > In particular, it can race with nouveau_fence_context_kill(),
> > > > > which
> > > > > would then attempt to set an error code on an already
> > > > > signaled
> > > > > fence,
> > > > > which is illegal.
> > > > >
> > > > > In nouveau_fence_done(), the call to nouveau_fence_update()
> > > > > already
> > > > > ensures to signal all ready fences. Thus, the signaling
> > > > > potentially
> > > > > performed by dma_fence_is_signaled() is actually not
> > > > > necessary.
> > > > Ah, I now got what you are trying to do here! But that won't
> > > > help.
> > > >
> > > > The problem is it is perfectly valid for somebody external
> > > > (e.g.
> > > > other driver, TTM etc...) to call dma_fence_is_signaled() on a
> > > > nouveau fence.
> > > >
> > > > This will then in turn still signal the fence and leave it on
> > > > the
> > > > pending list and creating the problem you have.
> > > Good to hear – precisely that then is the use case for a
> > > dma_fence
> > > callback! ^_^ It guarantees that, no matter who signals a fence,
> > > no
> > > matter at what place, a certain action will always be performed.
> > >
> > > I can't think of any other mechanism which could guarantee that a
> > > signaled fence immediately gets removed from nouveau's pending
> > > list,
> > > other than the callbacks.
> > >
> > > But seriously, I don't think that anyone does this currently, nor
> > > do I
> > > think that anyone could get away with doing it without the entire
> > > computer burning down.
> >
> > Yeah, I don't think that this is possible at the moment.
> >
> > When you do stuff like that from the provider side you will always
> > run into lifetime issues because in the signaling from interrupt
> > case
> > you then drop the last reference before the signaling is completed.
> >
> > How about the attached (not even compile tested) patch? I think it
> > should fix the issue.
>
> This patch looked correct enough for me to try it out on top of my
> memleak fix series [1] (which seems to reveal all those problems
> through races appearing due to the removal of the waitqueue in
> nouveau_sched_fini()).
>
> The code looked correct to me, but it still makes boom-boom, again
> because two parties get their fingers onto list_del():
>
> [paste in case my editor explodes again:
> https://paste.debian.net/1368705/ ]
>
> [ 41.681698] list_del corruption, ff31ae696cdc86a0->next is
> LIST_POISON1 (dead000000000100)
> [ 41.681720] ------------[ cut here ]------------
> [ 41.681722] kernel BUG at lib/list_debug.c:56!
> [ 41.681729] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 41.681732] CPU: 22 UID: 42 PID: 1733 Comm: gnome-shell Not
> tainted
> 6.14.0-rc4+ #11
> [ 41.681735] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6,
> BIOS 2.7.0 12/17/2024
> [ 41.681737] RIP: 0010:__list_del_entry_valid_or_report+0x76/0xf0
> [ 41.681743] Code: 75 66 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc
> 48
> 89 ef e8 4c e7 b0 ff 48 89 ea 48 89 de 48 c7 c7 38 fb b5 a0 e8 3a 6d
> 6b
> ff <0f> 0b 4c 89 e7 e8 30 e7 b0 ff 4c 89 e2 48 89 de 48 c7 c7 70 fb
> b5
> [ 41.681745] RSP: 0018:ff4fe30cc0f83b30 EFLAGS: 00010246
> [ 41.681748] RAX: 000000000000004e RBX: ff31ae696cdc86a0 RCX:
> 0000000000000027
> [ 41.681749] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
> ff31ae8850321900
> [ 41.681751] RBP: dead000000000100 R08: 0000000000000000 R09:
> 0000000000000000
> [ 41.681752] R10: 7572726f63206c65 R11: 6c65645f7473696c R12:
> dead000000000122
> [ 41.681753] R13: ff31ae696cdc8662 R14: ff4fe30cc0f83cb8 R15:
> 00007f68b7f9a000
> [ 41.681754] FS: 00007f68bd0396c0(0000) GS:ff31ae8850300000(0000)
> knlGS:0000000000000000
> [ 41.681756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 41.681757] CR2: 00005577caaad68c CR3: 000000010407c003 CR4:
> 0000000000f71ef0
> [ 41.681758] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 41.681759] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> 0000000000000400
> [ 41.681760] PKRU: 55555554
> [ 41.681761] Call Trace:
> [ 41.681763] <TASK>
> [ 41.681764] ? __die_body.cold+0x19/0x27
> [ 41.681768] ? die+0x2e/0x50
> [ 41.681772] ? do_trap+0xca/0x110
> [ 41.681775] ? do_error_trap+0x6a/0x90
> [ 41.681776] ? __list_del_entry_valid_or_report+0x76/0xf0
> [ 41.681778] ? exc_invalid_op+0x50/0x70
> [ 41.681781] ? __list_del_entry_valid_or_report+0x76/0xf0
> [ 41.681782] ? asm_exc_invalid_op+0x1a/0x20
> [ 41.681788] ? __list_del_entry_valid_or_report+0x76/0xf0
> [ 41.681789] nouveau_fence_is_signaled+0x47/0xc0 [nouveau]
> [ 41.681961] dma_resv_iter_walk_unlocked.part.0+0xbd/0x170
> [ 41.681966] dma_resv_test_signaled+0x53/0x100
> [ 41.681969] ttm_bo_release+0x12d/0x2f0 [ttm]
> [ 41.681979] nouveau_gem_object_del+0x54/0x80 [nouveau]
> [ 41.682090] ttm_bo_vm_close+0x41/0x60 [ttm]
> [ 41.682097] remove_vma+0x2c/0x70
> [ 41.682100] vms_complete_munmap_vmas+0xd8/0x180
> [ 41.682102] do_vmi_align_munmap+0x1d7/0x250
> [ 41.682106] do_vmi_munmap+0xd0/0x170
> [ 41.682109] __vm_munmap+0xb1/0x180
> [ 41.682112] __x64_sys_munmap+0x1b/0x30
> [ 41.682115] do_syscall_64+0x82/0x160
> [ 41.682117] ? do_user_addr_fault+0x55a/0x7b0
> [ 41.682121] ? exc_page_fault+0x7e/0x1a0
> [ 41.682124] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 41.682127] RIP: 0033:0x7f68cceff02b
> [ 41.682130] Code: 73 01 c3 48 8b 0d e5 6d 0f 00 f7 d8 64 89 01 48
> 83
> c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00
> 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 6d 0f 00 f7 d8 64 89 01
> 48
> [ 41.682131] RSP: 002b:00007ffed8d00c08 EFLAGS: 00000206 ORIG_RAX:
> 000000000000000b
> [ 41.682134] RAX: ffffffffffffffda RBX: 00005577ca99ef50 RCX:
> 00007f68cceff02b
> [ 41.682135] RDX: 0000000000000000 RSI: 0000000000001000 RDI:
> 00007f68b7f9a000
> [ 41.682136] RBP: 00007ffed8d00c50 R08: 00005577cacc4160 R09:
> 00005577caccf930
> [ 41.682137] R10: 000199999996d999 R11: 0000000000000206 R12:
> 0000000000000000
> [ 41.682138] R13: 00007ffed8d00c60 R14: 00005577caf6c550 R15:
> 0000000000000035
> [ 41.682141] </TASK>
> [ 41.682141] Modules linked in: nf_conntrack_netbios_ns
> nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> rfkill
> ip_set nf_tables qrtr sunrpc snd_sof_pci_intel_tgl
> snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
> snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
> snd_sof_intel_hda snd_sof snd_sof_utils snd_soc_acpi_intel_match
> snd_soc_acpi snd_soc_acpi_intel_sdca_quirks snd_sof_intel_hda_mlink
> snd_soc_sdca snd_soc_avs snd_ctl_led intel_rapl_msr snd_soc_hda_codec
> snd_hda_ext_core intel_rapl_common snd_hda_codec_realtek snd_soc_core
> intel_uncore_frequency snd_hda_codec_generic
> intel_uncore_frequency_common intel_ifs snd_hda_scodec_component
> snd_hda_codec_hdmi i10nm_edac snd_compress skx_edac_common
> binfmt_misc
> nfit snd_hda_intel snd_intel_dspcfg snd_hda_codec libnvdimm snd_hwdep
> snd_hda_core snd_seq snd_seq_device x86_pkg_temp_thermal dell_pc
> dell_wmi
> [ 41.682195] dax_hmem platform_profile intel_powerclamp
> sparse_keymap cxl_acpi snd_pcm cxl_port coretemp iTCO_wdt cxl_core
> spi_nor intel_pmc_bxt dell_wmi_sysman rapl pmt_telemetry dell_smbios
> iTCO_vendor_support pmt_class intel_cstate snd_timer vfat dcdbas
> isst_if_mmio mtd dell_smm_hwmon dell_wmi_ddv dell_wmi_descriptor
> intel_uncore firmware_attributes_class wmi_bmof atlantic fat einj
> pcspkr isst_if_mbox_pci snd isst_if_common intel_vsec i2c_i801 mei_me
> e1000e spi_intel_pci macsec soundcore i2c_smbus spi_intel mei joydev
> loop nfnetlink zram nouveau drm_ttm_helper ttm iaa_crypto
> polyval_clmulni rtsx_pci_sdmmc polyval_generic mmc_core gpu_sched
> ghash_clmulni_intel i2c_algo_bit nvme sha512_ssse3 drm_gpuvm drm_exec
> sha256_ssse3 idxd nvme_core sha1_ssse3 drm_display_helper rtsx_pci
> cec
> nvme_auth idxd_bus pinctrl_alderlake ip6_tables ip_tables fuse
> [ 41.682269] ---[ end trace 0000000000000000 ]---
> [ 41.969442] RIP: 0010:__list_del_entry_valid_or_report+0x76/0xf0
> [ 41.969458] Code: 75 66 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc
> 48
> 89 ef e8 4c e7 b0 ff 48 89 ea 48 89 de 48 c7 c7 38 fb b5 a0 e8 3a 6d
> 6b
> ff <0f> 0b 4c 89 e7 e8 30 e7 b0 ff 4c 89 e2 48 89 de 48 c7 c7 70 fb
> b5
> [ 41.969461] RSP: 0018:ff4fe30cc0f83b30 EFLAGS: 00010246
> [ 41.969464] RAX: 000000000000004e RBX: ff31ae696cdc86a0 RCX:
> 0000000000000027
> [ 41.969466] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
> ff31ae8850321900
> [ 41.969468] RBP: dead000000000100 R08: 0000000000000000 R09:
> 0000000000000000
> [ 41.969469] R10: 7572726f63206c65 R11: 6c65645f7473696c R12:
> dead000000000122
> [ 41.969470] R13: ff31ae696cdc8662 R14: ff4fe30cc0f83cb8 R15:
> 00007f68b7f9a000
> [ 41.969471] FS: 00007f68bd0396c0(0000) GS:ff31ae8850300000(0000)
> knlGS:0000000000000000
> [ 41.969473] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 41.969474] CR2: 00005577caaad68c CR3: 000000010407c003 CR4:
> 0000000000f71ef0
> [ 41.969476] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 41.969477] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> 0000000000000400
> [ 41.969478] PKRU: 55555554
>
>
> I fail to see why exactly right now, but am also quite tired. Might
> take another look the next days.
>
> Although I'm not convinced that my solution is bad either. It's
> Nouveau, after all. On this ranch a cowboy has to defend himself with
> the pitchfork instead of the colt at times.
>
>
> [1]
> https://lore.kernel.org/all/20250407152239.34429-2-phasta@kernel.org/
>
I think I see the issue now. Let's look at your code, Christian:
/*
* In an ideal world, read would not assume the channel context is
still alive.
* This function may be called from another device, running into free
memory as a
* result. The drm node should still be there, so we can derive the
index from
* the fence context.
*/
static bool nouveau_fence_is_signaled(struct dma_fence *f)
{
struct nouveau_fence *fence = from_fence(f);
struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
struct nouveau_channel *chan;
bool ret = false;
rcu_read_lock();
chan = rcu_dereference(fence->channel);
if (chan)
ret = (int)(fctx->read(chan) - fence->base.seqno) >=
0;
rcu_read_unlock();
if (ret) {
/*
* caller should have a reference on the fence,
* else fence could get freed here
*/
WARN_ON(kref_read(&fence->base.refcount) <= 1);
list_del(&fence->head);
dma_fence_put(&fence->base);
}
return ret;
}
[snip]
static const struct dma_fence_ops nouveau_fence_ops_uevent = {
.get_driver_name = nouveau_fence_get_get_driver_name,
.get_timeline_name = nouveau_fence_get_timeline_name,
.enable_signaling = nouveau_fence_enable_signaling,
.signaled = nouveau_fence_is_signaled,
.release = nouveau_fence_release
};
So the nouveau_fence_done() will run into nouveau_fence_is_signaled().
This will remove the list entry without any locking, because
dma_fence_is_signaled() expects its callback to take the lock itself:
bool
nouveau_fence_done(struct nouveau_fence *fence)
{
if (fence->base.ops == &nouveau_fence_ops_legacy ||
fence->base.ops == &nouveau_fence_ops_uevent) {
struct nouveau_fence_chan *fctx = nouveau_fctx(fence);
struct nouveau_channel *chan;
unsigned long flags;
if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence-
>base.flags))
return true;
spin_lock_irqsave(&fctx->lock, flags);
chan = rcu_dereference_protected(fence->channel,
lockdep_is_held(&fctx->lock));
if (chan && nouveau_fence_update(chan, fctx))
nvif_event_block(&fctx->event);
spin_unlock_irqrestore(&fctx->lock, flags);
}
return dma_fence_is_signaled(&fence->base);
}
It could be, however, that at the same moment nouveau_fence_signal() is
removing that entry, holding the appropriate lock.
So we have a race. Again.
You see, fixing things in Nouveau is difficult :)
It gets more difficult if you want to clean it up "properly", so it
conforms to rules such as those from dma_fence.
I have now provided two fixes that both work, but you are not satisfied
with from the dma_fence-maintainer's perspective. I understand that,
but please also understand that it's actually not my primary task to
work on Nouveau. I just have to fix this bug to move on with my
scheduler work.
So if you have another idea, feel free to share it. But I'd like to
know how we can go on here.
I'm running out of ideas. What I'm wondering if we couldn't just remove
performance hacky fastpath functions such as
nouveau_fence_is_signaled() completely. It seems redundant to me.
Or we might add locking to it, but IDK what was achieved with RCU here.
In any case it's definitely bad that Nouveau has so many redundant and
half-redundant mechanisms.
P.
>
>
> P.
>
> >
> > Regards,
> > Christian.
> >
> > >
> > > P.
> > >
> > >
> > >
> > > > Regards,
> > > > Christian.
> > > >
> > > > > Replace the call to dma_fence_is_signaled() with
> > > > > nouveau_fence_base_is_signaled().
> > > > >
> > > > > Cc: <stable@vger.kernel.org> # 4.10+, precise commit not to
> > > > > be
> > > > > determined
> > > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > > ---
> > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > index 7cc84472cece..33535987d8ed 100644
> > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct nouveau_fence
> > > > > *fence)
> > > > > nvif_event_block(&fctx->event);
> > > > > spin_unlock_irqrestore(&fctx->lock, flags);
> > > > > }
> > > > > - return dma_fence_is_signaled(&fence->base);
> > > > > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence-
> > > > > > base.flags);
> > > > > }
> > > > >
> > > > > static long
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
[not found] ` <81a70ba6-94b1-4bb3-a0b2-9e8890f90b33@amd.com>
@ 2025-04-11 12:44 ` Philipp Stanner
2025-04-11 13:06 ` Christian König
0 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-11 12:44 UTC (permalink / raw)
To: Christian König, phasta, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
> Am 11.04.25 um 11:29 schrieb Philipp Stanner:
>
> > [SNIP]
> >
> > It could be, however, that at the same moment
> > nouveau_fence_signal() is
> > removing that entry, holding the appropriate lock.
> >
> > So we have a race. Again.
> >
>
> Ah, yes of course. If signaled is called with or without the lock is
> actually undetermined.
>
>
> >
> > You see, fixing things in Nouveau is difficult :)
> > It gets more difficult if you want to clean it up "properly", so it
> > conforms to rules such as those from dma_fence.
> >
> > I have now provided two fixes that both work, but you are not
> > satisfied
> > with from the dma_fence-maintainer's perspective. I understand
> > that,
> > but please also understand that it's actually not my primary task
> > to
> > work on Nouveau. I just have to fix this bug to move on with my
> > scheduler work.
> >
>
> Well I'm happy with whatever solution as long as it works, but as
> far as I can see the approach with the callback simply doesn't.
>
> You just can't drop the fence reference for the list from the
> callback.
>
>
> >
> > So if you have another idea, feel free to share it. But I'd like to
> > know how we can go on here.
> >
>
> Well the fence code actually works, doesn't it? The problem is
> rather that setting the error throws a warning because it doesn't
> expect signaled fences on the pending list.
>
> Maybe we should fix that instead.
The fence code works as the author intended, but I would be happy if it
were more explicitly documented.
Regarding the WARN_ON: It occurs in dma_fence_set_error() because there
is an attempt to set an error code on a signaled fence. I don't think
that should be "fixed", it works as intended: You must not set an error
code of a fence that was already signaled.
The reason seems to be that once a fence is signaled, a third party
might evaluate the error code.
But I think this wasn't wat you meant with "fix".
In any case, there must not be signaled fences in nouveau's pending-
list. They must be removed immediately once they signal, and this must
not race.
>
>
> >
> > I'm running out of ideas. What I'm wondering if we couldn't just
> > remove
> > performance hacky fastpath functions such as
> > nouveau_fence_is_signaled() completely. It seems redundant to me.
> >
>
> That would work for me as well.
I'll test this approach. Seems a bit like the nuclear approach, but if
it works we'd at least clean up a lot of this mess.
P.
>
>
> >
> >
> > Or we might add locking to it, but IDK what was achieved with RCU
> > here.
> > In any case it's definitely bad that Nouveau has so many redundant
> > and
> > half-redundant mechanisms.
> >
>
> Yeah, agree messing with the locks even more won't help us here.
>
> Regards,
> Christian.
>
>
> >
> >
> >
> > P.
> >
> >
> > >
> > >
> > > P.
> > >
> > >
> > > >
> > > > Regards,
> > > > Christian.
> > > >
> > > >
> > > > >
> > > > > P.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Replace the call to dma_fence_is_signaled() with
> > > > > > > nouveau_fence_base_is_signaled().
> > > > > > >
> > > > > > > Cc: <stable@vger.kernel.org> # 4.10+, precise commit not
> > > > > > > to
> > > > > > > be
> > > > > > > determined
> > > > > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > > > > ---
> > > > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > index 7cc84472cece..33535987d8ed 100644
> > > > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct
> > > > > > > nouveau_fence
> > > > > > > *fence)
> > > > > > > nvif_event_block(&fctx->event);
> > > > > > > spin_unlock_irqrestore(&fctx->lock,
> > > > > > > flags);
> > > > > > > }
> > > > > > > - return dma_fence_is_signaled(&fence->base);
> > > > > > > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > > > > > > &fence-
> > > > > > >
> > > > > > > >
> > > > > > > > base.flags);
> > > > > > > >
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > > static long
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-11 12:44 ` Philipp Stanner
@ 2025-04-11 13:06 ` Christian König
2025-04-11 14:10 ` Philipp Stanner
0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2025-04-11 13:06 UTC (permalink / raw)
To: phasta, Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
Am 11.04.25 um 14:44 schrieb Philipp Stanner:
> On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
>> Am 11.04.25 um 11:29 schrieb Philipp Stanner:
>>
>>> [SNIP]
>>>
>>> It could be, however, that at the same moment
>>> nouveau_fence_signal() is
>>> removing that entry, holding the appropriate lock.
>>>
>>> So we have a race. Again.
>>>
>>
>> Ah, yes of course. If signaled is called with or without the lock is
>> actually undetermined.
>>
>>
>>>
>>> You see, fixing things in Nouveau is difficult :)
>>> It gets more difficult if you want to clean it up "properly", so it
>>> conforms to rules such as those from dma_fence.
>>>
>>> I have now provided two fixes that both work, but you are not
>>> satisfied
>>> with from the dma_fence-maintainer's perspective. I understand
>>> that,
>>> but please also understand that it's actually not my primary task
>>> to
>>> work on Nouveau. I just have to fix this bug to move on with my
>>> scheduler work.
>>>
>>
>> Well I'm happy with whatever solution as long as it works, but as
>> far as I can see the approach with the callback simply doesn't.
>>
>> You just can't drop the fence reference for the list from the
>> callback.
>>
>>
>>>
>>> So if you have another idea, feel free to share it. But I'd like to
>>> know how we can go on here.
>>>
>>
>> Well the fence code actually works, doesn't it? The problem is
>> rather that setting the error throws a warning because it doesn't
>> expect signaled fences on the pending list.
>>
>> Maybe we should fix that instead.
> The fence code works as the author intended, but I would be happy if it
> were more explicitly documented.
>
> Regarding the WARN_ON: It occurs in dma_fence_set_error() because there
> is an attempt to set an error code on a signaled fence. I don't think
> that should be "fixed", it works as intended: You must not set an error
> code of a fence that was already signaled.
>
> The reason seems to be that once a fence is signaled, a third party
> might evaluate the error code.
Yeah, more or less correct. The idea is you can't declare an operation as having an error after the operation has already completed.
Because everyone will just wait for the completion and nobody checks the status again after that.
>
> But I think this wasn't wat you meant with "fix".
The idea was to avoid calling dma_fence_set_error() on already signaled fences. Something like this:
@@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct nouveau_fence_chan *fctx, int error)
while (!list_empty(&fctx->pending)) {
fence = list_entry(fctx->pending.next, typeof(*fence), head);
- if (error)
+ if (error & !dma_fence_is_signaled_locked(&fence->base))
dma_fence_set_error(&fence->base, error);
if (nouveau_fence_signal(fence))
That would also improve the handling quite a bit since we now don't set errors on fences which are already completed even if we haven't realized that they are already completed yet.
> In any case, there must not be signaled fences in nouveau's pending-
> list. They must be removed immediately once they signal, and this must
> not race.
Why actually? As far as I can see the pending list is not for the unsignaled fences, but rather the pending interrupt processing.
So having signaled fences on the pending list is perfectly possible.
Regards,
Christian.
>
>>
>>
>>>
>>> I'm running out of ideas. What I'm wondering if we couldn't just
>>> remove
>>> performance hacky fastpath functions such as
>>> nouveau_fence_is_signaled() completely. It seems redundant to me.
>>>
>>
>> That would work for me as well.
> I'll test this approach. Seems a bit like the nuclear approach, but if
> it works we'd at least clean up a lot of this mess.
>
>
> P.
>
>
>>
>>
>>>
>>>
>>> Or we might add locking to it, but IDK what was achieved with RCU
>>> here.
>>> In any case it's definitely bad that Nouveau has so many redundant
>>> and
>>> half-redundant mechanisms.
>>>
>>
>> Yeah, agree messing with the locks even more won't help us here.
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>>
>>>
>>> P.
>>>
>>>
>>>>
>>>>
>>>> P.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>>>
>>>>>> P.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Replace the call to dma_fence_is_signaled() with
>>>>>>>> nouveau_fence_base_is_signaled().
>>>>>>>>
>>>>>>>> Cc: <stable@vger.kernel.org> # 4.10+, precise commit not
>>>>>>>> to
>>>>>>>> be
>>>>>>>> determined
>>>>>>>> Signed-off-by: Philipp Stanner <phasta@kernel.org>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
>>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
>>>>>>>> b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>>>>>>> index 7cc84472cece..33535987d8ed 100644
>>>>>>>> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
>>>>>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
>>>>>>>> @@ -274,7 +274,7 @@ nouveau_fence_done(struct
>>>>>>>> nouveau_fence
>>>>>>>> *fence)
>>>>>>>> nvif_event_block(&fctx->event);
>>>>>>>> spin_unlock_irqrestore(&fctx->lock,
>>>>>>>> flags);
>>>>>>>> }
>>>>>>>> - return dma_fence_is_signaled(&fence->base);
>>>>>>>> + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
>>>>>>>> &fence-
>>>>>>>>
>>>>>>>>>
>>>>>>>>> base.flags);
>>>>>>>>>
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> static long
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-11 13:06 ` Christian König
@ 2025-04-11 14:10 ` Philipp Stanner
2025-04-14 8:54 ` Philipp Stanner
0 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-11 14:10 UTC (permalink / raw)
To: Christian König, phasta, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
On Fri, 2025-04-11 at 15:06 +0200, Christian König wrote:
> Am 11.04.25 um 14:44 schrieb Philipp Stanner:
> > On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
> > > Am 11.04.25 um 11:29 schrieb Philipp Stanner:
> > >
> > > > [SNIP]
> > > >
> > > > It could be, however, that at the same moment
> > > > nouveau_fence_signal() is
> > > > removing that entry, holding the appropriate lock.
> > > >
> > > > So we have a race. Again.
> > > >
> > >
> > > Ah, yes of course. If signaled is called with or without the
> > > lock is
> > > actually undetermined.
> > >
> > >
> > > >
> > > > You see, fixing things in Nouveau is difficult :)
> > > > It gets more difficult if you want to clean it up "properly",
> > > > so it
> > > > conforms to rules such as those from dma_fence.
> > > >
> > > > I have now provided two fixes that both work, but you are not
> > > > satisfied
> > > > with from the dma_fence-maintainer's perspective. I understand
> > > > that,
> > > > but please also understand that it's actually not my primary
> > > > task
> > > > to
> > > > work on Nouveau. I just have to fix this bug to move on with my
> > > > scheduler work.
> > > >
> > >
> > > Well I'm happy with whatever solution as long as it works, but
> > > as
> > > far as I can see the approach with the callback simply doesn't.
> > >
> > > You just can't drop the fence reference for the list from the
> > > callback.
> > >
> > >
> > > >
> > > > So if you have another idea, feel free to share it. But I'd
> > > > like to
> > > > know how we can go on here.
> > > >
> > >
> > > Well the fence code actually works, doesn't it? The problem is
> > > rather that setting the error throws a warning because it doesn't
> > > expect signaled fences on the pending list.
> > >
> > > Maybe we should fix that instead.
> > The fence code works as the author intended, but I would be happy
> > if it
> > were more explicitly documented.
> >
> > Regarding the WARN_ON: It occurs in dma_fence_set_error() because
> > there
> > is an attempt to set an error code on a signaled fence. I don't
> > think
> > that should be "fixed", it works as intended: You must not set an
> > error
> > code of a fence that was already signaled.
> >
> > The reason seems to be that once a fence is signaled, a third party
> > might evaluate the error code.
>
> Yeah, more or less correct. The idea is you can't declare an
> operation as having an error after the operation has already
> completed.
>
> Because everyone will just wait for the completion and nobody checks
> the status again after that.
>
> >
> > But I think this wasn't wat you meant with "fix".
>
> The idea was to avoid calling dma_fence_set_error() on already
> signaled fences. Something like this:
>
> @@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct
> nouveau_fence_chan *fctx, int error)
> while (!list_empty(&fctx->pending)) {
> fence = list_entry(fctx->pending.next,
> typeof(*fence), head);
>
> - if (error)
> + if (error & !dma_fence_is_signaled_locked(&fence-
> >base))
> dma_fence_set_error(&fence->base, error);
>
> if (nouveau_fence_signal(fence))
>
> That would also improve the handling quite a bit since we now don't
> set errors on fences which are already completed even if we haven't
> realized that they are already completed yet.
>
> > In any case, there must not be signaled fences in nouveau's
> > pending-
> > list. They must be removed immediately once they signal, and this
> > must
> > not race.
>
> Why actually? As far as I can see the pending list is not for the
> unsignaled fences, but rather the pending interrupt processing.
That's a list of fences that are "in the air", i.e., whose jobs are
currently being processed by the hardware. Once a job is done, its
fence must be removed.
>
> So having signaled fences on the pending list is perfectly possible.
It is possible, and that is a bug. The list is used by
nouveau_fence_context_kill() to kill still pending jobs. It shall not
try to kill and set error codes for fences that are already signaled.
Anyways, forget about the "remove callbacks solution" it actually
causes a MASSIVE performance regression. No idea why, AFAICS the fast
path is only ever evaluated in nouveau_fence_done(), but maybe I missed
something.
Will re-iterate next week…
P.
>
> Regards,
> Christian.
>
> >
> > >
> > >
> > > >
> > > > I'm running out of ideas. What I'm wondering if we couldn't
> > > > just
> > > > remove
> > > > performance hacky fastpath functions such as
> > > > nouveau_fence_is_signaled() completely. It seems redundant to
> > > > me.
> > > >
> > >
> > > That would work for me as well.
> > I'll test this approach. Seems a bit like the nuclear approach, but
> > if
> > it works we'd at least clean up a lot of this mess.
> >
> >
> > P.
> >
> >
> > >
> > >
> > > >
> > > >
> > > > Or we might add locking to it, but IDK what was achieved with
> > > > RCU
> > > > here.
> > > > In any case it's definitely bad that Nouveau has so many
> > > > redundant
> > > > and
> > > > half-redundant mechanisms.
> > > >
> > >
> > > Yeah, agree messing with the locks even more won't help us here.
> > >
> > > Regards,
> > > Christian.
> > >
> > >
> > > >
> > > >
> > > >
> > > > P.
> > > >
> > > >
> > > > >
> > > > >
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > P.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Replace the call to dma_fence_is_signaled() with
> > > > > > > > > nouveau_fence_base_is_signaled().
> > > > > > > > >
> > > > > > > > > Cc: <stable@vger.kernel.org> # 4.10+, precise commit
> > > > > > > > > not
> > > > > > > > > to
> > > > > > > > > be
> > > > > > > > > determined
> > > > > > > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > > > > > > ---
> > > > > > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > index 7cc84472cece..33535987d8ed 100644
> > > > > > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct
> > > > > > > > > nouveau_fence
> > > > > > > > > *fence)
> > > > > > > > > nvif_event_block(&fctx-
> > > > > > > > > >event);
> > > > > > > > > spin_unlock_irqrestore(&fctx->lock,
> > > > > > > > > flags);
> > > > > > > > > }
> > > > > > > > > - return dma_fence_is_signaled(&fence->base);
> > > > > > > > > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > > > > > > > > &fence-
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > base.flags);
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > static long
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-11 14:10 ` Philipp Stanner
@ 2025-04-14 8:54 ` Philipp Stanner
2025-04-14 14:27 ` Danilo Krummrich
0 siblings, 1 reply; 23+ messages in thread
From: Philipp Stanner @ 2025-04-14 8:54 UTC (permalink / raw)
To: phasta, Christian König, Lyude Paul, Danilo Krummrich,
David Airlie, Simona Vetter, Sabrina Dubroca, Sumit Semwal
Cc: dri-devel, nouveau, linux-kernel, netdev, linux-media,
linaro-mm-sig, stable
On Fri, 2025-04-11 at 16:10 +0200, Philipp Stanner wrote:
> On Fri, 2025-04-11 at 15:06 +0200, Christian König wrote:
> > Am 11.04.25 um 14:44 schrieb Philipp Stanner:
> > > On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
> > > > Am 11.04.25 um 11:29 schrieb Philipp Stanner:
> > > >
> > > > > [SNIP]
> > > > >
> > > > > It could be, however, that at the same moment
> > > > > nouveau_fence_signal() is
> > > > > removing that entry, holding the appropriate lock.
> > > > >
> > > > > So we have a race. Again.
> > > > >
> > > >
> > > > Ah, yes of course. If signaled is called with or without the
> > > > lock is
> > > > actually undetermined.
> > > >
> > > >
> > > > >
> > > > > You see, fixing things in Nouveau is difficult :)
> > > > > It gets more difficult if you want to clean it up "properly",
> > > > > so it
> > > > > conforms to rules such as those from dma_fence.
> > > > >
> > > > > I have now provided two fixes that both work, but you are not
> > > > > satisfied
> > > > > with from the dma_fence-maintainer's perspective. I
> > > > > understand
> > > > > that,
> > > > > but please also understand that it's actually not my primary
> > > > > task
> > > > > to
> > > > > work on Nouveau. I just have to fix this bug to move on with
> > > > > my
> > > > > scheduler work.
> > > > >
> > > >
> > > > Well I'm happy with whatever solution as long as it works, but
> > > > as
> > > > far as I can see the approach with the callback simply doesn't.
> > > >
> > > > You just can't drop the fence reference for the list from the
> > > > callback.
> > > >
> > > >
> > > > >
> > > > > So if you have another idea, feel free to share it. But I'd
> > > > > like to
> > > > > know how we can go on here.
> > > > >
> > > >
> > > > Well the fence code actually works, doesn't it? The problem is
> > > > rather that setting the error throws a warning because it
> > > > doesn't
> > > > expect signaled fences on the pending list.
> > > >
> > > > Maybe we should fix that instead.
> > > The fence code works as the author intended, but I would be happy
> > > if it
> > > were more explicitly documented.
> > >
> > > Regarding the WARN_ON: It occurs in dma_fence_set_error() because
> > > there
> > > is an attempt to set an error code on a signaled fence. I don't
> > > think
> > > that should be "fixed", it works as intended: You must not set an
> > > error
> > > code of a fence that was already signaled.
> > >
> > > The reason seems to be that once a fence is signaled, a third
> > > party
> > > might evaluate the error code.
> >
> > Yeah, more or less correct. The idea is you can't declare an
> > operation as having an error after the operation has already
> > completed.
> >
> > Because everyone will just wait for the completion and nobody
> > checks
> > the status again after that.
> >
> > >
> > > But I think this wasn't wat you meant with "fix".
> >
> > The idea was to avoid calling dma_fence_set_error() on already
> > signaled fences. Something like this:
> >
> > @@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct
> > nouveau_fence_chan *fctx, int error)
> > while (!list_empty(&fctx->pending)) {
> > fence = list_entry(fctx->pending.next,
> > typeof(*fence), head);
> >
> > - if (error)
> > + if (error & !dma_fence_is_signaled_locked(&fence-
> > > base))
> > dma_fence_set_error(&fence->base, error);
> >
> > if (nouveau_fence_signal(fence))
> >
> > That would also improve the handling quite a bit since we now don't
> > set errors on fences which are already completed even if we haven't
> > realized that they are already completed yet.
> >
> > > In any case, there must not be signaled fences in nouveau's
> > > pending-
> > > list. They must be removed immediately once they signal, and this
> > > must
> > > not race.
> >
> > Why actually? As far as I can see the pending list is not for the
> > unsignaled fences, but rather the pending interrupt processing.
>
> That's a list of fences that are "in the air", i.e., whose jobs are
> currently being processed by the hardware. Once a job is done, its
> fence must be removed.
>
> >
> > So having signaled fences on the pending list is perfectly
> > possible.
>
> It is possible, and that is a bug. The list is used by
> nouveau_fence_context_kill() to kill still pending jobs. It shall not
> try to kill and set error codes for fences that are already signaled.
@Danilo:
We have now 2 possible solutions for the firing WARN_ON floating.
Version A (Christian)
Check in nouveau_fence_context_kill() whether a fence is already
signaled before setting an error.
Version B (Me)
This patch series here. Make sure that in Nouveau, only
nouveau_fence_signal() signals fences.
Both should do the trick. Please share a maintainer-preference so I can
move on here.
Thx
P.
>
>
>
> Anyways, forget about the "remove callbacks solution" it actually
> causes a MASSIVE performance regression. No idea why, AFAICS the fast
> path is only ever evaluated in nouveau_fence_done(), but maybe I
> missed
> something.
>
> Will re-iterate next week…
>
>
> P.
>
>
> >
> > Regards,
> > Christian.
> >
> > >
> > > >
> > > >
> > > > >
> > > > > I'm running out of ideas. What I'm wondering if we couldn't
> > > > > just
> > > > > remove
> > > > > performance hacky fastpath functions such as
> > > > > nouveau_fence_is_signaled() completely. It seems redundant to
> > > > > me.
> > > > >
> > > >
> > > > That would work for me as well.
> > > I'll test this approach. Seems a bit like the nuclear approach,
> > > but
> > > if
> > > it works we'd at least clean up a lot of this mess.
> > >
> > >
> > > P.
> > >
> > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > Or we might add locking to it, but IDK what was achieved with
> > > > > RCU
> > > > > here.
> > > > > In any case it's definitely bad that Nouveau has so many
> > > > > redundant
> > > > > and
> > > > > half-redundant mechanisms.
> > > > >
> > > >
> > > > Yeah, agree messing with the locks even more won't help us
> > > > here.
> > > >
> > > > Regards,
> > > > Christian.
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > P.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > P.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > P.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Replace the call to dma_fence_is_signaled() with
> > > > > > > > > > nouveau_fence_base_is_signaled().
> > > > > > > > > >
> > > > > > > > > > Cc: <stable@vger.kernel.org> # 4.10+, precise
> > > > > > > > > > commit
> > > > > > > > > > not
> > > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > determined
> > > > > > > > > > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > > > > > > > > > ---
> > > > > > > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +-
> > > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > >
> > > > > > > > > > diff --git
> > > > > > > > > > a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > index 7cc84472cece..33535987d8ed 100644
> > > > > > > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> > > > > > > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct
> > > > > > > > > > nouveau_fence
> > > > > > > > > > *fence)
> > > > > > > > > > nvif_event_block(&fctx-
> > > > > > > > > > > event);
> > > > > > > > > > spin_unlock_irqrestore(&fctx-
> > > > > > > > > > >lock,
> > > > > > > > > > flags);
> > > > > > > > > > }
> > > > > > > > > > - return dma_fence_is_signaled(&fence-
> > > > > > > > > > >base);
> > > > > > > > > > + return
> > > > > > > > > > test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> > > > > > > > > > &fence-
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > base.flags);
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > static long
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> >
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-14 8:54 ` Philipp Stanner
@ 2025-04-14 14:27 ` Danilo Krummrich
2025-04-15 9:56 ` Christian König
0 siblings, 1 reply; 23+ messages in thread
From: Danilo Krummrich @ 2025-04-14 14:27 UTC (permalink / raw)
To: phasta
Cc: Christian König, Lyude Paul, David Airlie, Simona Vetter,
Sabrina Dubroca, Sumit Semwal, dri-devel, nouveau, linux-kernel,
netdev, linux-media, linaro-mm-sig, stable
On Mon, Apr 14, 2025 at 10:54:25AM +0200, Philipp Stanner wrote:
> @Danilo:
> We have now 2 possible solutions for the firing WARN_ON floating.
>
> Version A (Christian)
> Check in nouveau_fence_context_kill() whether a fence is already
> signaled before setting an error.
>
> Version B (Me)
> This patch series here. Make sure that in Nouveau, only
> nouveau_fence_signal() signals fences.
>
>
> Both should do the trick. Please share a maintainer-preference so I can
> move on here.
Thanks for working on this Philipp.
If you don't want to rework things entirely, A seems to be superior, since it
also catches the case when someone else would call dma_fence_is_signaled() on a
nouveau fence (which could happen at any time). This doesn't seem to be caught
by B, right?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-14 14:27 ` Danilo Krummrich
@ 2025-04-15 9:56 ` Christian König
2025-04-15 12:54 ` Philipp Stanner
0 siblings, 1 reply; 23+ messages in thread
From: Christian König @ 2025-04-15 9:56 UTC (permalink / raw)
To: Danilo Krummrich, phasta
Cc: Lyude Paul, David Airlie, Simona Vetter, Sabrina Dubroca,
Sumit Semwal, dri-devel, nouveau, linux-kernel, netdev,
linux-media, linaro-mm-sig, stable
Am 14.04.25 um 16:27 schrieb Danilo Krummrich:
> On Mon, Apr 14, 2025 at 10:54:25AM +0200, Philipp Stanner wrote:
>> @Danilo:
>> We have now 2 possible solutions for the firing WARN_ON floating.
>>
>> Version A (Christian)
>> Check in nouveau_fence_context_kill() whether a fence is already
>> signaled before setting an error.
>>
>> Version B (Me)
>> This patch series here. Make sure that in Nouveau, only
>> nouveau_fence_signal() signals fences.
>>
>>
>> Both should do the trick. Please share a maintainer-preference so I can
>> move on here.
> Thanks for working on this Philipp.
>
> If you don't want to rework things entirely, A seems to be superior, since it
> also catches the case when someone else would call dma_fence_is_signaled() on a
> nouveau fence (which could happen at any time). This doesn't seem to be caught
> by B, right?
Correct, yes. I would also keep it as simple as possible for backporting this bug fix.
On the other hand a rework is certainly appropriate including both nouveau as well as the DMA-fence calling rules. Especially that the DMA-fence framework calls the signaled callback with inconsistent locking is something we should fix.
Regards,
Christian.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list
2025-04-15 9:56 ` Christian König
@ 2025-04-15 12:54 ` Philipp Stanner
0 siblings, 0 replies; 23+ messages in thread
From: Philipp Stanner @ 2025-04-15 12:54 UTC (permalink / raw)
To: Christian König, Danilo Krummrich, phasta
Cc: Lyude Paul, David Airlie, Simona Vetter, Sabrina Dubroca,
Sumit Semwal, dri-devel, nouveau, linux-kernel, netdev,
linux-media, linaro-mm-sig, stable
On Tue, 2025-04-15 at 11:56 +0200, Christian König wrote:
> Am 14.04.25 um 16:27 schrieb Danilo Krummrich:
> > On Mon, Apr 14, 2025 at 10:54:25AM +0200, Philipp Stanner wrote:
> > > @Danilo:
> > > We have now 2 possible solutions for the firing WARN_ON floating.
> > >
> > > Version A (Christian)
> > > Check in nouveau_fence_context_kill() whether a fence is already
> > > signaled before setting an error.
> > >
> > > Version B (Me)
> > > This patch series here. Make sure that in Nouveau, only
> > > nouveau_fence_signal() signals fences.
> > >
> > >
> > > Both should do the trick. Please share a maintainer-preference so
> > > I can
> > > move on here.
> > Thanks for working on this Philipp.
> >
> > If you don't want to rework things entirely, A seems to be
> > superior, since it
> > also catches the case when someone else would call
> > dma_fence_is_signaled() on a
> > nouveau fence (which could happen at any time). This doesn't seem
> > to be caught
> > by B, right?
>
> Correct, yes. I would also keep it as simple as possible for
> backporting this bug fix.
>
> On the other hand a rework is certainly appropriate including both
> nouveau as well as the DMA-fence calling rules. Especially that the
> DMA-fence framework calls the signaled callback with inconsistent
> locking is something we should fix.
Do you have a suggestion where to start?
I btw would still be interested in adding some sort of centralized
mechanism in dma_fence that the driver could use to do some cleanup
stuff once a fence gets signaled ^_^
P.
>
> Regards,
> Christian.
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-04-15 12:54 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-10 9:24 [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
2025-04-10 9:24 ` [PATCH 1/3] drm/nouveau: Prevent signaled fences in pending list Philipp Stanner
2025-04-10 12:13 ` Christian König
2025-04-10 12:21 ` Danilo Krummrich
2025-04-10 12:42 ` Christian König
2025-04-10 12:58 ` Christian König
2025-04-10 13:09 ` Philipp Stanner
2025-04-10 13:16 ` Christian König
2025-04-10 15:36 ` Philipp Stanner
2025-04-11 9:29 ` Philipp Stanner
[not found] ` <81a70ba6-94b1-4bb3-a0b2-9e8890f90b33@amd.com>
2025-04-11 12:44 ` Philipp Stanner
2025-04-11 13:06 ` Christian König
2025-04-11 14:10 ` Philipp Stanner
2025-04-14 8:54 ` Philipp Stanner
2025-04-14 14:27 ` Danilo Krummrich
2025-04-15 9:56 ` Christian König
2025-04-15 12:54 ` Philipp Stanner
2025-04-10 9:24 ` [PATCH 2/3] drm/nouveau: Remove surplus if-branch Philipp Stanner
2025-04-10 12:15 ` Christian König
2025-04-10 9:24 ` [PATCH 3/3] drm/nouveau: Add helper to check base fence Philipp Stanner
2025-04-10 9:51 ` [PATCH 0/3] drm/nouveau: Fix & improve nouveau_fence_done() Philipp Stanner
2025-04-10 12:18 ` Christian König
2025-04-10 13:18 ` Philipp Stanner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).