From: Danilo Krummrich <dakr@kernel.org>
To: Zhi Wang <zhiw@nvidia.com>
Cc: nouveau@lists.freedesktop.org, airlied@gmail.com,
daniel@ffwll.ch, bskeggs@nvidia.com, acurrid@nvidia.com,
cjia@nvidia.com, smitra@nvidia.com, ankita@nvidia.com,
aniketa@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com,
zhiwang@kernel.org, Karol Herbst <kherbst@redhat.com>,
Lyude Paul <lyude@redhat.com>, Danilo Krummrich <dakr@redhat.com>
Subject: Re: [PATCH 1/3] nvkm/gsp: correctly advance the read pointer of GSP message queue
Date: Fri, 4 Oct 2024 19:10:37 +0200 [thread overview]
Message-ID: <ZwAhjctJdIZRbyAX@pollux> (raw)
In-Reply-To: <20240922130709.1946893-2-zhiw@nvidia.com>
Hi Zhi,
On Sun, Sep 22, 2024 at 06:07:07AM -0700, Zhi Wang wrote:
> A GSP event message consists three parts: message header, RPC header,
> message body. GSP calculates the number of pages to write from the
> total size of a GSP message. This behavior can be observed from the
> movement of the write pointer.
>
> However, nvkm takes only the size of RPC header and message body as
> the message size when advancing the read pointer. When handling a
> two-page GSP message in the non rollback case, It wrongly takes the
> message body of the previous message as the message header of the next
> message. As the "message length" tends to be zero, in the calculation of
> size needs to be copied (0 - size of (message header)), the size needs to
> be copied will be "0xffffffxx". It also triggers a kernel panic due to a
> NULL pointer error.
>
> [ 547.614102] msg: 00000f90: ff ff ff ff ff ff ff ff 40 d7 18 fb 8b 00 00 00 ........@.......
> [ 547.622533] msg: 00000fa0: 00 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 ................
> [ 547.630965] msg: 00000fb0: ff ff ff ff ff ff ff ff 00 00 00 00 ff ff ff ff ................
> [ 547.639397] msg: 00000fc0: ff ff ff ff 00 00 00 00 ff ff ff ff ff ff ff ff ................
> [ 547.647832] nvkm 0000:c1:00.0: gsp: peek msg rpc fn:0 len:0x0/0xffffffffffffffe0
> [ 547.655225] nvkm 0000:c1:00.0: gsp: get msg rpc fn:0 len:0x0/0xffffffffffffffe0
> [ 547.662532] BUG: kernel NULL pointer dereference, address: 0000000000000020
> [ 547.669485] #PF: supervisor read access in kernel mode
> [ 547.674624] #PF: error_code(0x0000) - not-present page
> [ 547.679755] PGD 0 P4D 0
> [ 547.682294] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 547.686643] CPU: 22 PID: 322 Comm: kworker/22:1 Tainted: G E 6.9.0-rc6+ #1
> [ 547.694893] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022
> [ 547.702626] Workqueue: events r535_gsp_msgq_work [nvkm]
> [ 547.707921] RIP: 0010:r535_gsp_msg_recv+0x87/0x230 [nvkm]
> [ 547.713375] Code: 00 8b 70 08 48 89 e1 31 d2 4c 89 f7 e8 12 f5 ff ff 48 89 c5 48 85 c0 0f 84 cf 00 00 00 48 81 fd 00 f0 ff ff 0f 87 c4 00 00 00 <8b> 55 10 41 8b 46 30 85 d2 0f 85 f6 00 00 00 83 f8 04 76 10 ba 05
> [ 547.732119] RSP: 0018:ffffabe440f87e10 EFLAGS: 00010203
> [ 547.737335] RAX: 0000000000000010 RBX: 0000000000000008 RCX: 000000000000003f
> [ 547.744461] RDX: 0000000000000000 RSI: ffffabe4480a8030 RDI: 0000000000000010
> [ 547.751585] RBP: 0000000000000010 R08: 0000000000000000 R09: ffffabe440f87bb0
> [ 547.758707] R10: ffffabe440f87dc8 R11: 0000000000000010 R12: 0000000000000000
> [ 547.765834] R13: 0000000000000000 R14: ffff9351df1e5000 R15: 0000000000000000
> [ 547.772958] FS: 0000000000000000(0000) GS:ffff93708eb00000(0000) knlGS:0000000000000000
> [ 547.781035] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 547.786771] CR2: 0000000000000020 CR3: 00000003cc220002 CR4: 0000000000770ef0
> [ 547.793896] PKRU: 55555554
> [ 547.796600] Call Trace:
> [ 547.799046] <TASK>
> [ 547.801152] ? __die+0x20/0x70
> [ 547.804211] ? page_fault_oops+0x75/0x170
> [ 547.808221] ? print_hex_dump+0x100/0x160
> [ 547.812226] ? exc_page_fault+0x64/0x150
> [ 547.816152] ? asm_exc_page_fault+0x22/0x30
> [ 547.820341] ? r535_gsp_msg_recv+0x87/0x230 [nvkm]
> [ 547.825184] r535_gsp_msgq_work+0x42/0x50 [nvkm]
> [ 547.829845] process_one_work+0x196/0x3d0
> [ 547.833861] worker_thread+0x2fc/0x410
> [ 547.837613] ? __pfx_worker_thread+0x10/0x10
> [ 547.841885] kthread+0xdf/0x110
> [ 547.845031] ? __pfx_kthread+0x10/0x10
> [ 547.848775] ret_from_fork+0x30/0x50
> [ 547.852354] ? __pfx_kthread+0x10/0x10
> [ 547.856097] ret_from_fork_asm+0x1a/0x30
> [ 547.860019] </TASK>
> [ 547.862208] Modules linked in: nvkm(E) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) mlx5_ib(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) ipmi_devintf(E) mxm_wmi(E) joydev(E) rapl(E) ptdma(E) i2c_piix4(E) acpi_cpufreq(E) wmi_bmof(E) pcspkr(E) k10temp(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) drm_shmem_helper(E) crct10dif_pclmul(E) drm_kms_helper(E) ahci(E) crc32_pclmul(E) nvme_tcp(E) libahci(E) nvme(E) crc32c_intel(E) nvme_fabrics(E) cdc_ether(E) nvme_core(E) usbnet(E) mlx5_core(E) ghash_clmulni_intel(E) drm(E) libata(E) ccp(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E)
> [ 547.862283] iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)]
> [ 547.962691] CR2: 0000000000000020
> [ 547.966003] ---[ end trace 0000000000000000 ]---
> [ 549.012012] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1370499158 wd_nsec: 1370498904
> [ 549.043676] pstore: backend (erst) writing error (-28)
> [ 549.050924] RIP: 0010:r535_gsp_msg_recv+0x87/0x230 [nvkm]
> [ 549.056389] Code: 00 8b 70 08 48 89 e1 31 d2 4c 89 f7 e8 12 f5 ff ff 48 89 c5 48 85 c0 0f 84 cf 00 00 00 48 81 fd 00 f0 ff ff 0f 87 c4 00 00 00 <8b> 55 10 41 8b 46 30 85 d2 0f 85 f6 00 00 00 83 f8 04 76 10 ba 05
> [ 549.075138] RSP: 0018:ffffabe440f87e10 EFLAGS: 00010203
> [ 549.080361] RAX: 0000000000000010 RBX: 0000000000000008 RCX: 000000000000003f
> [ 549.087484] RDX: 0000000000000000 RSI: ffffabe4480a8030 RDI: 0000000000000010
> [ 549.094609] RBP: 0000000000000010 R08: 0000000000000000 R09: ffffabe440f87bb0
> [ 549.101733] R10: ffffabe440f87dc8 R11: 0000000000000010 R12: 0000000000000000
> [ 549.108857] R13: 0000000000000000 R14: ffff9351df1e5000 R15: 0000000000000000
> [ 549.115982] FS: 0000000000000000(0000) GS:ffff93708eb00000(0000) knlGS:0000000000000000
> [ 549.124061] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 549.129807] CR2: 0000000000000020 CR3: 00000003cc220002 CR4: 0000000000770ef0
> [ 549.136940] PKRU: 55555554
> [ 549.139653] Kernel panic - not syncing: Fatal exception
> [ 549.145054] Kernel Offset: 0x18c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 549.165074] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> Also, nvkm wrongly advances the read pointer when handling a two-page GSP
> message in the rollback case. In the rollback case, the GSP message will
> be copied in two rounds. When handling a two-page GSP message, nvkm first
> copies amount of (GSP_PAGE_SIZE - header) data into the buffer, then
> advances the read pointer by the result of DIV_ROUND_UP(size,
> GSP_PAGE_SIZE). Thus, the read pointer is advanced by 1.
>
> Next, nvkm copies the amount of (total size - (GSP_PAGE_SIZE -
> header)) data into the buffer. The left amount of the data will be always
> larger than one page since the message header is not taken into account
> in the first copy. Thus, the read pointer is advanced by DIV_ROUND_UP(
> size(larger than one page), GSP_PAGE_SIZE) = 2.
>
> In the end, the read pointer is wrongly advanced by 3 when handling a
> two-page GSP message in the rollback case.
>
> Fix the problems by taking the total size of the message into account
> when advancing the read pointer and calculate the read pointer in the end
> of the all copies for the rollback case.
>
> BTW: the two-page GSP message can be observed in the msgq when vGPU is
> enabled.
Thanks for the detailed description!
Do I get it right that with "vanilla nouveau" we're not able to hit this bug?
>
> Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM")
Please make sure to run ./scripts/checkpatch.pl, 'Fixes' uses 12 chars of sha1.
> Cc: Ben Skeggs <bskeggs@nvidia.com>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@redhat.com>
> Cc: David Airlie <airlied@gmail.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
> drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> index cf58f9da9139..736cde1987d0 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> @@ -121,6 +121,8 @@ r535_gsp_msgq_wait(struct nvkm_gsp *gsp, u32 repc, u32 *prepc, int *ptime)
> return mqe->data;
> }
>
> + size = ALIGN(repc + GSP_MSG_HDR_SIZE, GSP_PAGE_SIZE);
> +
> msg = kvmalloc(repc, GFP_KERNEL);
> if (!msg)
> return ERR_PTR(-ENOMEM);
> @@ -129,19 +131,15 @@ r535_gsp_msgq_wait(struct nvkm_gsp *gsp, u32 repc, u32 *prepc, int *ptime)
> len = min_t(u32, repc, len);
> memcpy(msg, mqe->data, len);
>
> - rptr += DIV_ROUND_UP(len, GSP_PAGE_SIZE);
> - if (rptr == gsp->msgq.cnt)
> - rptr = 0;
> -
> repc -= len;
>
> if (repc) {
> mqe = (void *)((u8 *)gsp->shm.msgq.ptr + 0x1000 + 0 * 0x1000);
> memcpy(msg + len, mqe, repc);
> -
> - rptr += DIV_ROUND_UP(repc, GSP_PAGE_SIZE);
> }
>
> + rptr = (rptr + DIV_ROUND_UP(size, GSP_PAGE_SIZE)) % gsp->msgq.cnt;
> +
> mb();
> (*gsp->msgq.rptr) = rptr;
> return msg;
> --
> 2.34.1
>
next prev parent reply other threads:[~2024-10-04 17:10 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-22 13:07 [PATCH 0/3] NVKM GSP RPC fixes Zhi Wang
2024-09-22 13:07 ` [PATCH 1/3] nvkm/gsp: correctly advance the read pointer of GSP message queue Zhi Wang
2024-10-04 17:10 ` Danilo Krummrich [this message]
2024-09-22 13:07 ` [PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer Zhi Wang
2024-10-04 17:16 ` Danilo Krummrich
2024-10-13 18:27 ` Zhi Wang
2024-10-14 15:31 ` Danilo Krummrich
2024-09-22 13:07 ` [PATCH 3/3] nvkm/gsp: handle the return of large RPC Zhi Wang
2024-10-14 16:18 ` Danilo Krummrich
-- strict thread matches above, loose matches on Subject: below --
2024-10-17 7:19 [PATCH 0/3] NVKM GSP RPC fixes Zhi Wang
2024-10-17 7:19 ` [PATCH 1/3] nvkm/gsp: correctly advance the read pointer of GSP message queue Zhi Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZwAhjctJdIZRbyAX@pollux \
--to=dakr@kernel.org \
--cc=acurrid@nvidia.com \
--cc=airlied@gmail.com \
--cc=aniketa@nvidia.com \
--cc=ankita@nvidia.com \
--cc=bskeggs@nvidia.com \
--cc=cjia@nvidia.com \
--cc=dakr@redhat.com \
--cc=daniel@ffwll.ch \
--cc=kherbst@redhat.com \
--cc=kwankhede@nvidia.com \
--cc=lyude@redhat.com \
--cc=nouveau@lists.freedesktop.org \
--cc=smitra@nvidia.com \
--cc=targupta@nvidia.com \
--cc=zhiw@nvidia.com \
--cc=zhiwang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.