AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amd/pm: Vangogh: Fix kernel memory out of bounds write
@ 2024-10-25 14:15 Tvrtko Ursulin
  2024-10-25 14:23 ` Mario Limonciello
  0 siblings, 1 reply; 4+ messages in thread
From: Tvrtko Ursulin @ 2024-10-25 14:15 UTC (permalink / raw)
  To: amd-gfx
  Cc: kernel-dev, Tvrtko Ursulin, Mario Limonciello, Evan Quan,
	Wenyou Yang, Alex Deucher, stable

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

KASAN reports that the GPU metrics table allocated in
vangogh_tables_init() is not large enough for the memset done in
smu_cmn_init_soft_gpu_metrics(). Condensed report follows:

[   33.861314] BUG: KASAN: slab-out-of-bounds in smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu]
[   33.861799] Write of size 168 at addr ffff888129f59500 by task mangoapp/1067
...
[   33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: G        W          6.12.0-rc4 #356 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544
[   33.861816] Tainted: [W]=WARN
[   33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 12/01/2023
[   33.861822] Call Trace:
[   33.861826]  <TASK>
[   33.861829]  dump_stack_lvl+0x66/0x90
[   33.861838]  print_report+0xce/0x620
[   33.861853]  kasan_report+0xda/0x110
[   33.862794]  kasan_check_range+0xfd/0x1a0
[   33.862799]  __asan_memset+0x23/0x40
[   33.862803]  smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[   33.863306]  vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[   33.864257]  vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[   33.865682]  amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[   33.866160]  amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
[   33.867135]  dev_attr_show+0x43/0xc0
[   33.867147]  sysfs_kf_seq_show+0x1f1/0x3b0
[   33.867155]  seq_read_iter+0x3f8/0x1140
[   33.867173]  vfs_read+0x76c/0xc50
[   33.867198]  ksys_read+0xfb/0x1d0
[   33.867214]  do_syscall_64+0x90/0x160
...
[   33.867353] Allocated by task 378 on cpu 7 at 22.794876s:
[   33.867358]  kasan_save_stack+0x33/0x50
[   33.867364]  kasan_save_track+0x17/0x60
[   33.867367]  __kasan_kmalloc+0x87/0x90
[   33.867371]  vangogh_init_smc_tables+0x3f9/0x840 [amdgpu]
[   33.867835]  smu_sw_init+0xa32/0x1850 [amdgpu]
[   33.868299]  amdgpu_device_init+0x467b/0x8d90 [amdgpu]
[   33.868733]  amdgpu_driver_load_kms+0x19/0xf0 [amdgpu]
[   33.869167]  amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu]
[   33.869608]  local_pci_probe+0xda/0x180
[   33.869614]  pci_device_probe+0x43f/0x6b0

Empirically we can confirm that the former allocates 152 bytes for the
table, while the latter memsets the 168 large block.

This is somewhat alleviated by the fact that allocation goes into a 192
SLAB bucket, but then for v3_0 metrics the table grows to 264 bytes which
would definitely be a problem.

Root cause appears that when GPU metrics tables for v2_4 parts were added
it was not considered to enlarge the table to fit.

The fix in this patch is rather "brute force" and perhaps later should be
done in a smarter way, by extracting and consolidating the part version to
size logic to a common helper, instead of brute forcing the largest
possible allocation. Nevertheless, for now this works and fixes the out of
bounds write.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics")
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Evan Quan <evan.quan@amd.com>
Cc: Wenyou Yang <WenYou.Yang@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: <stable@vger.kernel.org> # v6.6+
---
 drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
index 22737b11b1bf..36f4a4651918 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
@@ -242,7 +242,10 @@ static int vangogh_tables_init(struct smu_context *smu)
 		goto err0_out;
 	smu_table->metrics_time = 0;
 
-	smu_table->gpu_metrics_table_size = max(sizeof(struct gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2));
+	smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v2_2);
+	smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3));
+	smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4));
+	smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v3_0));
 	smu_table->gpu_metrics_table = kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
 	if (!smu_table->gpu_metrics_table)
 		goto err1_out;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amd/pm: Vangogh: Fix kernel memory out of bounds write
  2024-10-25 14:15 [PATCH] drm/amd/pm: Vangogh: Fix kernel memory out of bounds write Tvrtko Ursulin
@ 2024-10-25 14:23 ` Mario Limonciello
  2024-10-25 14:40   ` Tvrtko Ursulin
  0 siblings, 1 reply; 4+ messages in thread
From: Mario Limonciello @ 2024-10-25 14:23 UTC (permalink / raw)
  To: Tvrtko Ursulin, amd-gfx
  Cc: kernel-dev, Tvrtko Ursulin, Evan Quan, Wenyou Yang, Alex Deucher,
	stable

On 10/25/2024 09:15, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> KASAN reports that the GPU metrics table allocated in
> vangogh_tables_init() is not large enough for the memset done in
> smu_cmn_init_soft_gpu_metrics(). Condensed report follows:
> 
> [   33.861314] BUG: KASAN: slab-out-of-bounds in smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu]
> [   33.861799] Write of size 168 at addr ffff888129f59500 by task mangoapp/1067
> ...
> [   33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: G        W          6.12.0-rc4 #356 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544
> [   33.861816] Tainted: [W]=WARN
> [   33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 12/01/2023
> [   33.861822] Call Trace:
> [   33.861826]  <TASK>
> [   33.861829]  dump_stack_lvl+0x66/0x90
> [   33.861838]  print_report+0xce/0x620
> [   33.861853]  kasan_report+0xda/0x110
> [   33.862794]  kasan_check_range+0xfd/0x1a0
> [   33.862799]  __asan_memset+0x23/0x40
> [   33.862803]  smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
> [   33.863306]  vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
> [   33.864257]  vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
> [   33.865682]  amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
> [   33.866160]  amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779]
> [   33.867135]  dev_attr_show+0x43/0xc0
> [   33.867147]  sysfs_kf_seq_show+0x1f1/0x3b0
> [   33.867155]  seq_read_iter+0x3f8/0x1140
> [   33.867173]  vfs_read+0x76c/0xc50
> [   33.867198]  ksys_read+0xfb/0x1d0
> [   33.867214]  do_syscall_64+0x90/0x160
> ...
> [   33.867353] Allocated by task 378 on cpu 7 at 22.794876s:
> [   33.867358]  kasan_save_stack+0x33/0x50
> [   33.867364]  kasan_save_track+0x17/0x60
> [   33.867367]  __kasan_kmalloc+0x87/0x90
> [   33.867371]  vangogh_init_smc_tables+0x3f9/0x840 [amdgpu]
> [   33.867835]  smu_sw_init+0xa32/0x1850 [amdgpu]
> [   33.868299]  amdgpu_device_init+0x467b/0x8d90 [amdgpu]
> [   33.868733]  amdgpu_driver_load_kms+0x19/0xf0 [amdgpu]
> [   33.869167]  amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu]
> [   33.869608]  local_pci_probe+0xda/0x180
> [   33.869614]  pci_device_probe+0x43f/0x6b0
> 
> Empirically we can confirm that the former allocates 152 bytes for the
> table, while the latter memsets the 168 large block.
> 
> This is somewhat alleviated by the fact that allocation goes into a 192
> SLAB bucket, but then for v3_0 metrics the table grows to 264 bytes which
> would definitely be a problem.
> 
> Root cause appears that when GPU metrics tables for v2_4 parts were added
> it was not considered to enlarge the table to fit.
> 
> The fix in this patch is rather "brute force" and perhaps later should be
> done in a smarter way, by extracting and consolidating the part version to
> size logic to a common helper, instead of brute forcing the largest
> possible allocation. Nevertheless, for now this works and fixes the out of
> bounds write.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics")
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Cc: Evan Quan <evan.quan@amd.com>
> Cc: Wenyou Yang <WenYou.Yang@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: <stable@vger.kernel.org> # v6.6+
> ---
>   drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
> index 22737b11b1bf..36f4a4651918 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
> @@ -242,7 +242,10 @@ static int vangogh_tables_init(struct smu_context *smu)
>   		goto err0_out;
>   	smu_table->metrics_time = 0;
>   
> -	smu_table->gpu_metrics_table_size = max(sizeof(struct gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2));
> +	smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v2_2);
> +	smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3));
> +	smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4));
> +	smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v3_0));

AFAICT Van Gogh only supports 2.2, 2.3 or 2.4.  I don't think there is a 
need to compare to 3.0 to solve this bug this way.

But generally yeah moving the initialization to a helper that actually 
knows the size would be another way to solve this.

Or looking at it how about moving all the conditional code in 
vangogh_common_get_gpu_metrics() into vangogh_tables_init() and then 
assigning a vfunc for vangogh_common_get_gpu_metrics() to call?

>   	smu_table->gpu_metrics_table = kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
>   	if (!smu_table->gpu_metrics_table)
>   		goto err1_out;


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amd/pm: Vangogh: Fix kernel memory out of bounds write
  2024-10-25 14:23 ` Mario Limonciello
@ 2024-10-25 14:40   ` Tvrtko Ursulin
  2024-10-25 17:58     ` Mario Limonciello
  0 siblings, 1 reply; 4+ messages in thread
From: Tvrtko Ursulin @ 2024-10-25 14:40 UTC (permalink / raw)
  To: Mario Limonciello, Tvrtko Ursulin, amd-gfx
  Cc: kernel-dev, Evan Quan, Wenyou Yang, Alex Deucher, stable


On 25/10/2024 15:23, Mario Limonciello wrote:
> On 10/25/2024 09:15, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> KASAN reports that the GPU metrics table allocated in
>> vangogh_tables_init() is not large enough for the memset done in
>> smu_cmn_init_soft_gpu_metrics(). Condensed report follows:
>>
>> [   33.861314] BUG: KASAN: slab-out-of-bounds in 
>> smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu]
>> [   33.861799] Write of size 168 at addr ffff888129f59500 by task 
>> mangoapp/1067
>> ...
>> [   33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: 
>> G        W          6.12.0-rc4 #356 
>> 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544
>> [   33.861816] Tainted: [W]=WARN
>> [   33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 
>> 12/01/2023
>> [   33.861822] Call Trace:
>> [   33.861826]  <TASK>
>> [   33.861829]  dump_stack_lvl+0x66/0x90
>> [   33.861838]  print_report+0xce/0x620
>> [   33.861853]  kasan_report+0xda/0x110
>> [   33.862794]  kasan_check_range+0xfd/0x1a0
>> [   33.862799]  __asan_memset+0x23/0x40
>> [   33.862803]  smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 
>> 13b1bc364ec578808f676eba412c20eaab792779]
>> [   33.863306]  vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 
>> 13b1bc364ec578808f676eba412c20eaab792779]
>> [   33.864257]  vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 
>> 13b1bc364ec578808f676eba412c20eaab792779]
>> [   33.865682]  amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 
>> 13b1bc364ec578808f676eba412c20eaab792779]
>> [   33.866160]  amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 
>> 13b1bc364ec578808f676eba412c20eaab792779]
>> [   33.867135]  dev_attr_show+0x43/0xc0
>> [   33.867147]  sysfs_kf_seq_show+0x1f1/0x3b0
>> [   33.867155]  seq_read_iter+0x3f8/0x1140
>> [   33.867173]  vfs_read+0x76c/0xc50
>> [   33.867198]  ksys_read+0xfb/0x1d0
>> [   33.867214]  do_syscall_64+0x90/0x160
>> ...
>> [   33.867353] Allocated by task 378 on cpu 7 at 22.794876s:
>> [   33.867358]  kasan_save_stack+0x33/0x50
>> [   33.867364]  kasan_save_track+0x17/0x60
>> [   33.867367]  __kasan_kmalloc+0x87/0x90
>> [   33.867371]  vangogh_init_smc_tables+0x3f9/0x840 [amdgpu]
>> [   33.867835]  smu_sw_init+0xa32/0x1850 [amdgpu]
>> [   33.868299]  amdgpu_device_init+0x467b/0x8d90 [amdgpu]
>> [   33.868733]  amdgpu_driver_load_kms+0x19/0xf0 [amdgpu]
>> [   33.869167]  amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu]
>> [   33.869608]  local_pci_probe+0xda/0x180
>> [   33.869614]  pci_device_probe+0x43f/0x6b0
>>
>> Empirically we can confirm that the former allocates 152 bytes for the
>> table, while the latter memsets the 168 large block.
>>
>> This is somewhat alleviated by the fact that allocation goes into a 192
>> SLAB bucket, but then for v3_0 metrics the table grows to 264 bytes which
>> would definitely be a problem.
>>
>> Root cause appears that when GPU metrics tables for v2_4 parts were added
>> it was not considered to enlarge the table to fit.
>>
>> The fix in this patch is rather "brute force" and perhaps later should be
>> done in a smarter way, by extracting and consolidating the part 
>> version to
>> size logic to a common helper, instead of brute forcing the largest
>> possible allocation. Nevertheless, for now this works and fixes the 
>> out of
>> bounds write.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to 
>> acquire gpu_metrics")
>> Cc: Mario Limonciello <mario.limonciello@amd.com>
>> Cc: Evan Quan <evan.quan@amd.com>
>> Cc: Wenyou Yang <WenYou.Yang@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: <stable@vger.kernel.org> # v6.6+
>> ---
>>   drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c 
>> b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
>> index 22737b11b1bf..36f4a4651918 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
>> @@ -242,7 +242,10 @@ static int vangogh_tables_init(struct smu_context 
>> *smu)
>>           goto err0_out;
>>       smu_table->metrics_time = 0;
>> -    smu_table->gpu_metrics_table_size = max(sizeof(struct 
>> gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2));
>> +    smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v2_2);
>> +    smu_table->gpu_metrics_table_size = 
>> max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3));
>> +    smu_table->gpu_metrics_table_size = 
>> max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4));
>> +    smu_table->gpu_metrics_table_size = 
>> max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v3_0));
> 
> AFAICT Van Gogh only supports 2.2, 2.3 or 2.4.  I don't think there is a 
> need to compare to 3.0 to solve this bug this way.

Gotcha.

> But generally yeah moving the initialization to a helper that actually 
> knows the size would be another way to solve this.

Yeah I was looking at smu_cmn_init_soft_gpu_metrics() and how it has the 
data of how large table needs to be for each frev+crev combo. But didn't 
go as far as trying to figure out if frev+crev would be available at 
table allocation time.

> Or looking at it how about moving all the conditional code in 
> vangogh_common_get_gpu_metrics() into vangogh_tables_init() and then 
> assigning a vfunc for vangogh_common_get_gpu_metrics() to call?

I did not quite follow. Happy to work on it though and I can look into 
it with fresh eyes but not next week, but the week after.

Or in the meantime if you want me to respin this fix just with v3_0 line 
removed I can do that today.

Regards,

Tvrtko

>>       smu_table->gpu_metrics_table = 
>> kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL);
>>       if (!smu_table->gpu_metrics_table)
>>           goto err1_out;
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amd/pm: Vangogh: Fix kernel memory out of bounds write
  2024-10-25 14:40   ` Tvrtko Ursulin
@ 2024-10-25 17:58     ` Mario Limonciello
  0 siblings, 0 replies; 4+ messages in thread
From: Mario Limonciello @ 2024-10-25 17:58 UTC (permalink / raw)
  To: Tvrtko Ursulin, Tvrtko Ursulin, amd-gfx
  Cc: kernel-dev, Evan Quan, Wenyou Yang, Alex Deucher, stable

On 10/25/2024 09:40, Tvrtko Ursulin wrote:
> 
> On 25/10/2024 15:23, Mario Limonciello wrote:
>> On 10/25/2024 09:15, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>
>>> KASAN reports that the GPU metrics table allocated in
>>> vangogh_tables_init() is not large enough for the memset done in
>>> smu_cmn_init_soft_gpu_metrics(). Condensed report follows:
>>>
>>> [   33.861314] BUG: KASAN: slab-out-of-bounds in 
>>> smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu]
>>> [   33.861799] Write of size 168 at addr ffff888129f59500 by task 
>>> mangoapp/1067
>>> ...
>>> [   33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: 
>>> G        W          6.12.0-rc4 #356 
>>> 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544
>>> [   33.861816] Tainted: [W]=WARN
>>> [   33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 
>>> 12/01/2023
>>> [   33.861822] Call Trace:
>>> [   33.861826]  <TASK>
>>> [   33.861829]  dump_stack_lvl+0x66/0x90
>>> [   33.861838]  print_report+0xce/0x620
>>> [   33.861853]  kasan_report+0xda/0x110
>>> [   33.862794]  kasan_check_range+0xfd/0x1a0
>>> [   33.862799]  __asan_memset+0x23/0x40
>>> [   33.862803]  smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 
>>> 13b1bc364ec578808f676eba412c20eaab792779]
>>> [   33.863306]  vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 
>>> 13b1bc364ec578808f676eba412c20eaab792779]
>>> [   33.864257]  vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 
>>> 13b1bc364ec578808f676eba412c20eaab792779]
>>> [   33.865682]  amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 
>>> 13b1bc364ec578808f676eba412c20eaab792779]
>>> [   33.866160]  amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 
>>> 13b1bc364ec578808f676eba412c20eaab792779]
>>> [   33.867135]  dev_attr_show+0x43/0xc0
>>> [   33.867147]  sysfs_kf_seq_show+0x1f1/0x3b0
>>> [   33.867155]  seq_read_iter+0x3f8/0x1140
>>> [   33.867173]  vfs_read+0x76c/0xc50
>>> [   33.867198]  ksys_read+0xfb/0x1d0
>>> [   33.867214]  do_syscall_64+0x90/0x160
>>> ...
>>> [   33.867353] Allocated by task 378 on cpu 7 at 22.794876s:
>>> [   33.867358]  kasan_save_stack+0x33/0x50
>>> [   33.867364]  kasan_save_track+0x17/0x60
>>> [   33.867367]  __kasan_kmalloc+0x87/0x90
>>> [   33.867371]  vangogh_init_smc_tables+0x3f9/0x840 [amdgpu]
>>> [   33.867835]  smu_sw_init+0xa32/0x1850 [amdgpu]
>>> [   33.868299]  amdgpu_device_init+0x467b/0x8d90 [amdgpu]
>>> [   33.868733]  amdgpu_driver_load_kms+0x19/0xf0 [amdgpu]
>>> [   33.869167]  amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu]
>>> [   33.869608]  local_pci_probe+0xda/0x180
>>> [   33.869614]  pci_device_probe+0x43f/0x6b0
>>>
>>> Empirically we can confirm that the former allocates 152 bytes for the
>>> table, while the latter memsets the 168 large block.
>>>
>>> This is somewhat alleviated by the fact that allocation goes into a 192
>>> SLAB bucket, but then for v3_0 metrics the table grows to 264 bytes 
>>> which
>>> would definitely be a problem.
>>>
>>> Root cause appears that when GPU metrics tables for v2_4 parts were 
>>> added
>>> it was not considered to enlarge the table to fit.
>>>
>>> The fix in this patch is rather "brute force" and perhaps later 
>>> should be
>>> done in a smarter way, by extracting and consolidating the part 
>>> version to
>>> size logic to a common helper, instead of brute forcing the largest
>>> possible allocation. Nevertheless, for now this works and fixes the 
>>> out of
>>> bounds write.
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>> Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 
>>> to acquire gpu_metrics")
>>> Cc: Mario Limonciello <mario.limonciello@amd.com>
>>> Cc: Evan Quan <evan.quan@amd.com>
>>> Cc: Wenyou Yang <WenYou.Yang@amd.com>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: <stable@vger.kernel.org> # v6.6+
>>> ---
>>>   drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/ 
>>> drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
>>> index 22737b11b1bf..36f4a4651918 100644
>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
>>> @@ -242,7 +242,10 @@ static int vangogh_tables_init(struct 
>>> smu_context *smu)
>>>           goto err0_out;
>>>       smu_table->metrics_time = 0;
>>> -    smu_table->gpu_metrics_table_size = max(sizeof(struct 
>>> gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2));
>>> +    smu_table->gpu_metrics_table_size = sizeof(struct 
>>> gpu_metrics_v2_2);
>>> +    smu_table->gpu_metrics_table_size = max(smu_table- 
>>> >gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3));
>>> +    smu_table->gpu_metrics_table_size = max(smu_table- 
>>> >gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4));
>>> +    smu_table->gpu_metrics_table_size = max(smu_table- 
>>> >gpu_metrics_table_size, sizeof(struct gpu_metrics_v3_0));
>>
>> AFAICT Van Gogh only supports 2.2, 2.3 or 2.4.  I don't think there is 
>> a need to compare to 3.0 to solve this bug this way.
> 
> Gotcha.
> 
>> But generally yeah moving the initialization to a helper that actually 
>> knows the size would be another way to solve this.
> 
> Yeah I was looking at smu_cmn_init_soft_gpu_metrics() and how it has the 
> data of how large table needs to be for each frev+crev combo. But didn't 
> go as far as trying to figure out if frev+crev would be available at 
> table allocation time.
> 
>> Or looking at it how about moving all the conditional code in 
>> vangogh_common_get_gpu_metrics() into vangogh_tables_init() and then 
>> assigning a vfunc for vangogh_common_get_gpu_metrics() to call?
> 
> I did not quite follow. Happy to work on it though and I can look into 
> it with fresh eyes but not next week, but the week after.
> 
> Or in the meantime if you want me to respin this fix just with v3_0 line 
> removed I can do that today.

Yeah thanks for the respin.

Regarding what my suggestion was let me try to explain it.

1. Whenever vangogh_common_get_gpu_metrics() is called it will go down 
into the functions to gather data that match either 2.2, 2.3 or 2.4 
based upon the platform firmware.
2. Instead of querying in vangogh_common_get_gpu_metrics(), there can be 
a function pointer that vangogh_common_get_gpu_metrics() calls to make 
it much simpler.
3. vangogh_tables_init() can then detect whether it's 2.2, 2.3, or 2.4
    instead.
    By doing it this way, you can allocate the exactly correct size
    structure when it's initialized and you can set that function pointer
    that vangogh_common_get_gpu_metrics() would use during init.


> 
> Regards,
> 
> Tvrtko
> 
>>>       smu_table->gpu_metrics_table = kzalloc(smu_table- 
>>> >gpu_metrics_table_size, GFP_KERNEL);
>>>       if (!smu_table->gpu_metrics_table)
>>>           goto err1_out;
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-10-26 15:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25 14:15 [PATCH] drm/amd/pm: Vangogh: Fix kernel memory out of bounds write Tvrtko Ursulin
2024-10-25 14:23 ` Mario Limonciello
2024-10-25 14:40   ` Tvrtko Ursulin
2024-10-25 17:58     ` Mario Limonciello

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox