All of lore.kernel.org
 help / color / mirror / Atom feed
From: Danilo Krummrich <dakr@kernel.org>
To: Zhi Wang <zhiw@nvidia.com>
Cc: nouveau@lists.freedesktop.org, airlied@gmail.com,
	daniel@ffwll.ch, bskeggs@nvidia.com, acurrid@nvidia.com,
	cjia@nvidia.com, smitra@nvidia.com, ankita@nvidia.com,
	aniketa@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com,
	zhiwang@kernel.org, Karol Herbst <kherbst@redhat.com>,
	Lyude Paul <lyude@redhat.com>, Danilo Krummrich <dakr@redhat.com>
Subject: Re: [PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer
Date: Fri, 4 Oct 2024 19:16:26 +0200	[thread overview]
Message-ID: <ZwAi6oMlVoa82hb2@pollux> (raw)
In-Reply-To: <20240922130709.1946893-3-zhiw@nvidia.com>

On Sun, Sep 22, 2024 at 06:07:08AM -0700, Zhi Wang wrote:
> r535_gsp_cmdq_push() waits for the available page in the GSP cmdq
> buffer when handling a large RPC request. When it sees at least one
> available page in the cmdq, it quits the waiting with the amount of
> free buffer pages in the queue.
> 
> Unfortunately, it always takes the [write pointer, buf_size) as
> available buffer pages before rolling back and wrongly calculates the
> size of the data should be copied. Thus, it can overwrite the RPC
> request that GSP is currently reading, which causes GSP hang due
> to corrupted RPC request:
> 
> [  549.209389] ------------[ cut here ]------------
> [  549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> [  549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E)
> [  549.225752]  iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)]
> [  549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G            E      6.9.0-rc6+ #1
> [  549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022
> [  549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> [  549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c
> [  549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246
> [  549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28
> [  549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730
> [  549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70
> [  549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020
> [  549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000
> [  549.406988] FS:  00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000
> [  549.415076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0
> [  549.427963] PKRU: 55555554
> [  549.430672] Call Trace:
> [  549.433126]  <TASK>
> [  549.435233]  ? __warn+0x7f/0x130
> [  549.438473]  ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> [  549.443426]  ? report_bug+0x18a/0x1a0
> [  549.447098]  ? handle_bug+0x3c/0x70
> [  549.450589]  ? exc_invalid_op+0x14/0x70
> [  549.454430]  ? asm_exc_invalid_op+0x16/0x20
> [  549.458619]  ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> [  549.463565]  r535_gsp_msg_recv+0x46/0x230 [nvkm]
> [  549.468257]  r535_gsp_rpc_push+0x106/0x160 [nvkm]
> [  549.473033]  r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm]
> [  549.478422]  nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm]
> [  549.483899]  nvidia_grid_init+0xb1/0xd0 [nvkm]
> [  549.488420]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  549.493213]  nvkm_device_pci_probe+0x305/0x420 [nvkm]
> [  549.498338]  local_pci_probe+0x46/0xa0
> [  549.502096]  pci_call_probe+0x56/0x170
> [  549.505851]  pci_device_probe+0x79/0xf0
> [  549.509690]  ? driver_sysfs_add+0x59/0xc0
> [  549.513702]  really_probe+0xd9/0x380
> [  549.517282]  __driver_probe_device+0x78/0x150
> [  549.521640]  driver_probe_device+0x1e/0x90
> [  549.525746]  __driver_attach+0xd2/0x1c0
> [  549.529594]  ? __pfx___driver_attach+0x10/0x10
> [  549.534045]  bus_for_each_dev+0x78/0xd0
> [  549.537893]  bus_add_driver+0x112/0x210
> [  549.541750]  driver_register+0x5c/0x120
> [  549.545596]  ? __pfx_nvkm_init+0x10/0x10 [nvkm]
> [  549.550224]  do_one_initcall+0x44/0x300
> [  549.554063]  ? do_init_module+0x23/0x240
> [  549.557989]  do_init_module+0x64/0x240
> 
> Calculate the available buffer page before rolling back based on
> the result from the waiting.

It looks like you hit this one while working on the VFIO stuff too. So, same
question here, can we hit this case with "vanilla nouveau"?

> 
> Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM")

Same as in patch 1.

> Cc: Ben Skeggs <bskeggs@nvidia.com>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@redhat.com>
> Cc: David Airlie <airlied@gmail.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> index 736cde1987d0..49721935013b 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> @@ -161,7 +161,7 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
>  	u64 *end;
>  	u64 csum = 0;
>  	int free, time = 1000000;
> -	u32 wptr, size;
> +	u32 wptr, size, step;
>  	u32 off = 0;
>  
>  	argc = ALIGN(GSP_MSG_HDR_SIZE + argc, GSP_PAGE_SIZE);
> @@ -178,11 +178,13 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
>  	cmd->checksum = upper_32_bits(csum) ^ lower_32_bits(csum);
>  
>  	wptr = *gsp->cmdq.wptr;
> +

Please remove the addition of empty lines here...

>  	do {
>  		do {
>  			free = *gsp->cmdq.rptr + gsp->cmdq.cnt - wptr - 1;
>  			if (free >= gsp->cmdq.cnt)
>  				free -= gsp->cmdq.cnt;
> +

and here.

>  			if (free >= 1)
>  				break;
>  
> @@ -195,7 +197,9 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
>  		}
>  
>  		cqe = (void *)((u8 *)gsp->shm.cmdq.ptr + 0x1000 + wptr * 0x1000);
> -		size = min_t(u32, argc, (gsp->cmdq.cnt - wptr) * GSP_PAGE_SIZE);
> +		step = min_t(u32, free, (gsp->cmdq.cnt - wptr));
> +		size = min_t(u32, argc, step * GSP_PAGE_SIZE);
> +
>  		memcpy(cqe, (u8 *)cmd + off, size);
>  
>  		wptr += DIV_ROUND_UP(size, 0x1000);
> -- 
> 2.34.1
> 

  reply	other threads:[~2024-10-04 17:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-22 13:07 [PATCH 0/3] NVKM GSP RPC fixes Zhi Wang
2024-09-22 13:07 ` [PATCH 1/3] nvkm/gsp: correctly advance the read pointer of GSP message queue Zhi Wang
2024-10-04 17:10   ` Danilo Krummrich
2024-09-22 13:07 ` [PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer Zhi Wang
2024-10-04 17:16   ` Danilo Krummrich [this message]
2024-10-13 18:27     ` Zhi Wang
2024-10-14 15:31       ` Danilo Krummrich
2024-09-22 13:07 ` [PATCH 3/3] nvkm/gsp: handle the return of large RPC Zhi Wang
2024-10-14 16:18   ` Danilo Krummrich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZwAi6oMlVoa82hb2@pollux \
    --to=dakr@kernel.org \
    --cc=acurrid@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=bskeggs@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dakr@redhat.com \
    --cc=daniel@ffwll.ch \
    --cc=kherbst@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=lyude@redhat.com \
    --cc=nouveau@lists.freedesktop.org \
    --cc=smitra@nvidia.com \
    --cc=targupta@nvidia.com \
    --cc=zhiw@nvidia.com \
    --cc=zhiwang@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.