[BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
@ 2025-07-14  5:21 Aithal, Srikanth
  2025-07-14 14:12 ` Zheyun Shen
  0 siblings, 1 reply; 8+ messages in thread
From: Aithal, Srikanth @ 2025-07-14  5:21 UTC (permalink / raw)
  To: seanjc, szy0127, linux-next, kvm, linux-kernel

Hello,

While running the kselftest for SEV migration (sev_migrate_tes) on 
linux-next (6.16.0-rc5-next-20250711, commit a62b7a37e6) on an AMD-based 
paltforms [Milan,Genoa,Turin], I encountered below kernel crash while 
running kvm kselftests:

[ 714.008402] BUG: kernel NULL pointer dereference, address: 
0000000000000000
[ 714.015363] #PF: supervisor read access in kernel mode
[ 714.020504] #PF: error_code(0x0000) - not-present page
[ 714.025643] PGD 11364b067 P4D 11364b067 PUD 12e195067 PMD 0
[ 714.031303] Oops: Oops: 0000 [#1] SMP NOPTI
[ 714.035487] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Not 
tainted 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 
PREEMPT(voluntary)
[ 714.048253] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 
2.17.0 12/04/2024
[ 714.055905] RIP: 0010:_find_first_bit+0x1d/0x40
[ 714.060439] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 
48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 
1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
[ 714.079184] RSP: 0018:ffffb9a769b7fdc8 EFLAGS: 00010246
[ 714.084409] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 
000000000000f7ff
[ 714.091541] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
0000000000000000
[ 714.098674] RBP: ffffb9a769b7fde0 R08: ffff95e0a54ff670 R09: 
00000000000002aa
[ 714.105807] R10: ffff95ff801b7ec0 R11: 0000000000000086 R12: 
0000000000000080
[ 714.112939] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: 
ffff95e087e8ac98
[ 714.120072] FS: 00007fd51a0f5740(0000) GS:ffff95ffd53b0000(0000) 
knlGS:0000000000000000
[ 714.128156] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 714.133902] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 
0000000000770ef0
[ 714.141035] PKRU: 55555554
[ 714.143750] Call Trace:
[ 714.146201] <TASK>
[ 714.148307] ? sev_writeback_caches+0x25/0x40 [kvm_amd]
[ 714.153544] sev_guest_memory_reclaimed+0x34/0x40 [kvm_amd]
[ 714.159115] kvm_arch_guest_memory_reclaimed+0x12/0x20 [kvm]
[ 714.164817] kvm_mmu_notifier_release+0x3c/0x60 [kvm]
[ 714.169896] mmu_notifier_unregister+0x53/0xf0
[ 714.174343] kvm_destroy_vm+0x12d/0x2d0 [kvm]
[ 714.178727] kvm_vm_stats_release+0x34/0x60 [kvm]
[ 714.183459] __fput+0xf2/0x2d0
[ 714.186520] fput_close_sync+0x44/0xa0
[ 714.190269] __x64_sys_close+0x42/0x80
[ 714.194024] x64_sys_call+0x1960/0x2180
[ 714.197861] do_syscall_64+0x56/0x1e0
[ 714.201530] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 714.206579] RIP: 0033:0x7fd519efe717
[ 714.210161] Code: ff e8 6d ec 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 
1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 a3 83 f8 ff
[ 714.228906] RSP: 002b:00007fffbb2193e8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000003
[ 714.236472] RAX: ffffffffffffffda RBX: 0000000002623f48 RCX: 
00007fd519efe717
[ 714.243604] RDX: 0000000000420146 RSI: 000000000041f05e RDI: 
0000000000000029
[ 714.250737] RBP: 0000000002622e80 R08: 0000000000000000 R09: 
000000000042013e
[ 714.257869] R10: 00007fd519fb83dd R11: 0000000000000246 R12: 
0000000002623ed8
[ 714.265000] R13: 0000000002623ed8 R14: 000000000042fe08 R15: 
00007fd51a147000
[ 714.272136] </TASK>
[ 714.274326] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set 
nf_tables nfnetlink sunrpc nls_iso8859_1 amd_atl intel_rapl_msr 
intel_rapl_common amd64_edac ipmi_ssif ee1004 kvm_amd kvm rapl wmi_bmof 
i2c_piix4 pcspkr acpi_power_meter efi_pstore ipmi_si k10temp i2c_smbus 
acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dmi_sysfs 
xfs mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper 
ghash_clmulni_intel mpt3sas sha1_ssse3 raid_class drm tg3 ccp 
scsi_transport_sas sp5100_tco wmi dm_mirror dm_region_hash dm_log msr 
autofs4 aesni_intel
[ 714.336656] CR2: 0000000000000000
[ 714.339975] ---[ end trace 0000000000000000 ]---
[ 714.379956] pstore: backend (erst) writing error (-28)
[ 714.385093] RIP: 0010:_find_first_bit+0x1d/0x40
[ 714.389625] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 
48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 
1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
[ 714.408370] RSP: 0018:ffffb9a769b7fdc8 EFLAGS: 00010246
[ 714.413595] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 
000000000000f7ff
[ 714.420729] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
0000000000000000
[ 714.427862] RBP: ffffb9a769b7fde0 R08: ffff95e0a54ff670 R09: 
00000000000002aa
[ 714.434992] R10: ffff95ff801b7ec0 R11: 0000000000000086 R12: 
0000000000000080
[ 714.442126] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: 
ffff95e087e8ac98
[ 714.449257] FS: 00007fd51a0f5740(0000) GS:ffff95ffd53b0000(0000) 
knlGS:0000000000000000
[ 714.457344] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 714.463090] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 
0000000000770ef0
[ 714.470223] PKRU: 55555554
[ 714.472936] note: sev_migrate_tes[16663] exited with irqs disabled
[ 714.479189] BUG: kernel NULL pointer dereference, address: 
0000000000000000
[ 714.486145] #PF: supervisor read access in kernel mode
[ 714.491281] #PF: error_code(0x0000) - not-present page
[ 714.496421] PGD 11364b067 P4D 11364b067 PUD 12e195067 PMD 0
[ 714.502082] Oops: Oops: 0000 [#2] SMP NOPTI
[ 714.506267] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Tainted: G 
D 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 PREEMPT(voluntary)
[ 714.520593] Tainted: [D]=DIE
[ 714.523477] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 
2.17.0 12/04/2024
[ 714.531131] RIP: 0010:_find_first_bit+0x1d/0x40
[ 714.535662] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 
48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 
1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
[ 714.554409] RSP: 0018:ffffb9a769b7fcd0 EFLAGS: 00010246
[ 714.559635] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 
0000000000000000
[ 714.566768] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
0000000000000000
[ 714.573900] RBP: ffffb9a769b7fce8 R08: ffff95e0a54ff670 R09: 
0000000080100001
[ 714.581033] R10: 0000000000020000 R11: 0000000000000000 R12: 
0000000000000080
[ 714.588165] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: 
ffff95e089d95a08
[ 714.595296] FS: 0000000000000000(0000) GS:ffff95ffd53b0000(0000) 
knlGS:0000000000000000
[ 714.603381] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 714.609130] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 
0000000000770ef0
[ 714.616260] PKRU: 55555554
[ 714.618963] Call Trace:
[ 714.621407] <TASK>
[ 714.623516] ? sev_writeback_caches+0x25/0x40 [kvm_amd]
[ 714.628741] sev_guest_memory_reclaimed+0x34/0x40 [kvm_amd]
[ 714.634315] kvm_arch_guest_memory_reclaimed+0x12/0x20 [kvm]
[ 714.640008] kvm_mmu_notifier_release+0x3c/0x60 [kvm]
[ 714.645088] __mmu_notifier_release+0x73/0x1e0
[ 714.649532] ? srso_alias_return_thunk+0x5/0xfbef5
[ 714.654323] ? sched_clock_cpu+0x14/0x1a0
[ 714.658338] exit_mmap+0x3b1/0x400
[ 714.661745] ? srso_alias_return_thunk+0x5/0xfbef5
[ 714.666536] ? futex_cleanup+0xb0/0x460
[ 714.670375] ? srso_alias_return_thunk+0x5/0xfbef5
[ 714.675166] ? perf_event_exit_task_context+0x33/0x280
[ 714.680307] ? srso_alias_return_thunk+0x5/0xfbef5
[ 714.685100] ? srso_alias_return_thunk+0x5/0xfbef5
[ 714.689890] ? mutex_lock+0x17/0x50
[ 714.693383] ? srso_alias_return_thunk+0x5/0xfbef5
[ 714.698177] mmput+0x6a/0x130
[ 714.701148] do_exit+0x258/0xa40
[ 714.704385] make_task_dead+0x85/0x160
[ 714.708134] rewind_stack_and_make_dead+0x16/0x20
[ 714.712951] RIP: 0033:0x7fd519efe717
[ 714.716532] Code: Unable to access opcode bytes at 0x7fd519efe6ed.
[ 714.722710] RSP: 002b:00007fffbb2193e8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000003
[ 714.730276] RAX: ffffffffffffffda RBX: 0000000002623f48 RCX: 
00007fd519efe717
[ 714.737409] RDX: 0000000000420146 RSI: 000000000041f05e RDI: 
0000000000000029
[ 714.744543] RBP: 0000000002622e80 R08: 0000000000000000 R09: 
000000000042013e
[ 714.751673] R10: 00007fd519fb83dd R11: 0000000000000246 R12: 
0000000002623ed8
[ 714.758807] R13: 0000000002623ed8 R14: 000000000042fe08 R15: 
00007fd51a147000
[ 714.765942] </TASK>
[ 714.768132] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set 
nf_tables nfnetlink sunrpc nls_iso8859_1 amd_atl intel_rapl_msr 
intel_rapl_common amd64_edac ipmi_ssif ee1004 kvm_amd kvm rapl wmi_bmof 
i2c_piix4 pcspkr acpi_power_meter efi_pstore ipmi_si k10temp i2c_smbus 
acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dmi_sysfs 
xfs mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper 
ghash_clmulni_intel mpt3sas sha1_ssse3 raid_class drm tg3 ccp 
scsi_transport_sas sp5100_tco wmi dm_mirror dm_region_hash dm_log msr 
autofs4 aesni_intel
[ 714.830455] CR2: 0000000000000000
[ 714.833773] ---[ end trace 0000000000000000 ]---
[ 714.886371] RIP: 0010:_find_first_bit+0x1d/0x40
[ 714.890899] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 
48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 
1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
[ 714.909647] RSP: 0018:ffffb9a769b7fdc8 EFLAGS: 00010246
[ 714.914871] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 
000000000000f7ff
[ 714.922004] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
0000000000000000
[ 714.929138] RBP: ffffb9a769b7fde0 R08: ffff95e0a54ff670 R09: 
00000000000002aa
[ 714.936271] R10: ffff95ff801b7ec0 R11: 0000000000000086 R12: 
0000000000000080
[ 714.943400] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: 
ffff95e087e8ac98
[ 714.950527] FS: 0000000000000000(0000) GS:ffff95ffd53b0000(0000) 
knlGS:0000000000000000
[ 714.958613] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 714.964357] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 
0000000000770ef0
[ 714.971490] PKRU: 55555554
[ 714.974202] note: sev_migrate_tes[16663] exited with irqs disabled
[ 714.980397] Fixing recursive fault but reboot is needed!
[ 714.985708] BUG: scheduling while atomic: sev_migrate_tes/16663/0x00000000
[ 714.992580] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set 
nf_tables nfnetlink sunrpc nls_iso8859_1 amd_atl intel_rapl_msr 
intel_rapl_common amd64_edac ipmi_ssif ee1004 kvm_amd kvm rapl wmi_bmof 
i2c_piix4 pcspkr acpi_power_meter efi_pstore ipmi_si k10temp i2c_smbus 
acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dmi_sysfs 
xfs mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper 
ghash_clmulni_intel mpt3sas sha1_ssse3 raid_class drm tg3 ccp 
scsi_transport_sas sp5100_tco wmi dm_mirror dm_region_hash dm_log msr 
autofs4 aesni_intel
[ 715.054914] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Tainted: G 
D 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 PREEMPT(voluntary)
[ 715.054918] Tainted: [D]=DIE
[ 715.054920] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 
2.17.0 12/04/2024
[ 715.054921] Call Trace:
[ 715.054922] <TASK>
[ 715.054923] dump_stack_lvl+0x70/0x90
[ 715.054928] dump_stack+0x14/0x20
[ 715.054931] __schedule_bug+0x5a/0x70
[ 715.054934] __schedule+0xa0d/0xb30
[ 715.054938] ? srso_alias_return_thunk+0x5/0xfbef5
[ 715.054941] ? vprintk_default+0x21/0x30
[ 715.054944] ? srso_alias_return_thunk+0x5/0xfbef5
[ 715.054946] ? vprintk+0x1c/0x50
[ 715.054949] ? srso_alias_return_thunk+0x5/0xfbef5
[ 715.054952] do_task_dead+0x4e/0xa0
[ 715.054956] make_task_dead+0x146/0x160
[ 715.054960] rewind_stack_and_make_dead+0x16/0x20
[ 715.054962] RIP: 0033:0x7fd519efe717
[ 715.054964] Code: Unable to access opcode bytes at 0x7fd519efe6ed.
[ 715.054965] RSP: 002b:00007fffbb2193e8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000003
[ 715.054967] RAX: ffffffffffffffda RBX: 0000000002623f48 RCX: 
00007fd519efe717
[ 715.054968] RDX: 0000000000420146 RSI: 000000000041f05e RDI: 
0000000000000029
[ 715.054970] RBP: 0000000002622e80 R08: 0000000000000000 R09: 
000000000042013e
[ 715.054971] R10: 00007fd519fb83dd R11: 0000000000000246 R12: 
0000000002623ed8
[ 715.054972] R13: 0000000002623ed8 R14: 000000000042fe08 R15: 
00007fd51a147000
[ 715.054978] </TASK>


Below is the culprit commit:

commit d6581b6f2e2622f0fc350020a8e991e8be6b05d8
Author: Zheyun Shen szy0127@sjtu.edu.cn
Date: Thu May 22 16:37:32 2025 -0700

KVM: SVM: Flush cache only on CPUs running SEV guest
Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com

The issue goes away if I revert above commit.

Regards,
Srikanth Aithal sraithal@amd.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
  2025-07-14  5:21 [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform Aithal, Srikanth
@ 2025-07-14 14:12 ` Zheyun Shen
  2025-07-14 14:48   ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Zheyun Shen @ 2025-07-14 14:12 UTC (permalink / raw)
  To: Aithal, Srikanth; +Cc: seanjc, linux-next, kvm, linux-kernel

Hi Aithal,
I can reproduce this issue in my environment, and I will try to resolve it as soon as possible.

Thanks,
Zheyun Shen

> 2025年7月14日 13:21，Aithal, Srikanth <sraithal@amd.com> 写道：
> 
> Hello,
> 
> While running the kselftest for SEV migration (sev_migrate_tes) on linux-next (6.16.0-rc5-next-20250711, commit a62b7a37e6) on an AMD-based paltforms [Milan,Genoa,Turin], I encountered below kernel crash while running kvm kselftests:
> 
> [ 714.008402] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 714.015363] #PF: supervisor read access in kernel mode
> [ 714.020504] #PF: error_code(0x0000) - not-present page
> [ 714.025643] PGD 11364b067 P4D 11364b067 PUD 12e195067 PMD 0
> [ 714.031303] Oops: Oops: 0000 [#1] SMP NOPTI
> [ 714.035487] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Not tainted 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 PREEMPT(voluntary)
> [ 714.048253] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.17.0 12/04/2024
> [ 714.055905] RIP: 0010:_find_first_bit+0x1d/0x40
> [ 714.060439] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
> [ 714.079184] RSP: 0018:ffffb9a769b7fdc8 EFLAGS: 00010246
> [ 714.084409] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 000000000000f7ff
> [ 714.091541] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000000000
> [ 714.098674] RBP: ffffb9a769b7fde0 R08: ffff95e0a54ff670 R09: 00000000000002aa
> [ 714.105807] R10: ffff95ff801b7ec0 R11: 0000000000000086 R12: 0000000000000080
> [ 714.112939] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: ffff95e087e8ac98
> [ 714.120072] FS: 00007fd51a0f5740(0000) GS:ffff95ffd53b0000(0000) knlGS:0000000000000000
> [ 714.128156] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 714.133902] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 0000000000770ef0
> [ 714.141035] PKRU: 55555554
> [ 714.143750] Call Trace:
> [ 714.146201] <TASK>
> [ 714.148307] ? sev_writeback_caches+0x25/0x40 [kvm_amd]
> [ 714.153544] sev_guest_memory_reclaimed+0x34/0x40 [kvm_amd]
> [ 714.159115] kvm_arch_guest_memory_reclaimed+0x12/0x20 [kvm]
> [ 714.164817] kvm_mmu_notifier_release+0x3c/0x60 [kvm]
> [ 714.169896] mmu_notifier_unregister+0x53/0xf0
> [ 714.174343] kvm_destroy_vm+0x12d/0x2d0 [kvm]
> [ 714.178727] kvm_vm_stats_release+0x34/0x60 [kvm]
> [ 714.183459] __fput+0xf2/0x2d0
> [ 714.186520] fput_close_sync+0x44/0xa0
> [ 714.190269] __x64_sys_close+0x42/0x80
> [ 714.194024] x64_sys_call+0x1960/0x2180
> [ 714.197861] do_syscall_64+0x56/0x1e0
> [ 714.201530] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 714.206579] RIP: 0033:0x7fd519efe717
> [ 714.210161] Code: ff e8 6d ec 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 a3 83 f8 ff
> [ 714.228906] RSP: 002b:00007fffbb2193e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [ 714.236472] RAX: ffffffffffffffda RBX: 0000000002623f48 RCX: 00007fd519efe717
> [ 714.243604] RDX: 0000000000420146 RSI: 000000000041f05e RDI: 0000000000000029
> [ 714.250737] RBP: 0000000002622e80 R08: 0000000000000000 R09: 000000000042013e
> [ 714.257869] R10: 00007fd519fb83dd R11: 0000000000000246 R12: 0000000002623ed8
> [ 714.265000] R13: 0000000002623ed8 R14: 000000000042fe08 R15: 00007fd51a147000
> [ 714.272136] </TASK>
> [ 714.274326] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc nls_iso8859_1 amd_atl intel_rapl_msr intel_rapl_common amd64_edac ipmi_ssif ee1004 kvm_amd kvm rapl wmi_bmof i2c_piix4 pcspkr acpi_power_meter efi_pstore ipmi_si k10temp i2c_smbus acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dmi_sysfs xfs mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper ghash_clmulni_intel mpt3sas sha1_ssse3 raid_class drm tg3 ccp scsi_transport_sas sp5100_tco wmi dm_mirror dm_region_hash dm_log msr autofs4 aesni_intel
> [ 714.336656] CR2: 0000000000000000
> [ 714.339975] ---[ end trace 0000000000000000 ]---
> [ 714.379956] pstore: backend (erst) writing error (-28)
> [ 714.385093] RIP: 0010:_find_first_bit+0x1d/0x40
> [ 714.389625] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
> [ 714.408370] RSP: 0018:ffffb9a769b7fdc8 EFLAGS: 00010246
> [ 714.413595] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 000000000000f7ff
> [ 714.420729] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000000000
> [ 714.427862] RBP: ffffb9a769b7fde0 R08: ffff95e0a54ff670 R09: 00000000000002aa
> [ 714.434992] R10: ffff95ff801b7ec0 R11: 0000000000000086 R12: 0000000000000080
> [ 714.442126] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: ffff95e087e8ac98
> [ 714.449257] FS: 00007fd51a0f5740(0000) GS:ffff95ffd53b0000(0000) knlGS:0000000000000000
> [ 714.457344] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 714.463090] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 0000000000770ef0
> [ 714.470223] PKRU: 55555554
> [ 714.472936] note: sev_migrate_tes[16663] exited with irqs disabled
> [ 714.479189] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 714.486145] #PF: supervisor read access in kernel mode
> [ 714.491281] #PF: error_code(0x0000) - not-present page
> [ 714.496421] PGD 11364b067 P4D 11364b067 PUD 12e195067 PMD 0
> [ 714.502082] Oops: Oops: 0000 [#2] SMP NOPTI
> [ 714.506267] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Tainted: G D 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 PREEMPT(voluntary)
> [ 714.520593] Tainted: [D]=DIE
> [ 714.523477] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.17.0 12/04/2024
> [ 714.531131] RIP: 0010:_find_first_bit+0x1d/0x40
> [ 714.535662] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
> [ 714.554409] RSP: 0018:ffffb9a769b7fcd0 EFLAGS: 00010246
> [ 714.559635] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 0000000000000000
> [ 714.566768] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000000000
> [ 714.573900] RBP: ffffb9a769b7fce8 R08: ffff95e0a54ff670 R09: 0000000080100001
> [ 714.581033] R10: 0000000000020000 R11: 0000000000000000 R12: 0000000000000080
> [ 714.588165] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: ffff95e089d95a08
> [ 714.595296] FS: 0000000000000000(0000) GS:ffff95ffd53b0000(0000) knlGS:0000000000000000
> [ 714.603381] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 714.609130] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 0000000000770ef0
> [ 714.616260] PKRU: 55555554
> [ 714.618963] Call Trace:
> [ 714.621407] <TASK>
> [ 714.623516] ? sev_writeback_caches+0x25/0x40 [kvm_amd]
> [ 714.628741] sev_guest_memory_reclaimed+0x34/0x40 [kvm_amd]
> [ 714.634315] kvm_arch_guest_memory_reclaimed+0x12/0x20 [kvm]
> [ 714.640008] kvm_mmu_notifier_release+0x3c/0x60 [kvm]
> [ 714.645088] __mmu_notifier_release+0x73/0x1e0
> [ 714.649532] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 714.654323] ? sched_clock_cpu+0x14/0x1a0
> [ 714.658338] exit_mmap+0x3b1/0x400
> [ 714.661745] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 714.666536] ? futex_cleanup+0xb0/0x460
> [ 714.670375] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 714.675166] ? perf_event_exit_task_context+0x33/0x280
> [ 714.680307] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 714.685100] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 714.689890] ? mutex_lock+0x17/0x50
> [ 714.693383] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 714.698177] mmput+0x6a/0x130
> [ 714.701148] do_exit+0x258/0xa40
> [ 714.704385] make_task_dead+0x85/0x160
> [ 714.708134] rewind_stack_and_make_dead+0x16/0x20
> [ 714.712951] RIP: 0033:0x7fd519efe717
> [ 714.716532] Code: Unable to access opcode bytes at 0x7fd519efe6ed.
> [ 714.722710] RSP: 002b:00007fffbb2193e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [ 714.730276] RAX: ffffffffffffffda RBX: 0000000002623f48 RCX: 00007fd519efe717
> [ 714.737409] RDX: 0000000000420146 RSI: 000000000041f05e RDI: 0000000000000029
> [ 714.744543] RBP: 0000000002622e80 R08: 0000000000000000 R09: 000000000042013e
> [ 714.751673] R10: 00007fd519fb83dd R11: 0000000000000246 R12: 0000000002623ed8
> [ 714.758807] R13: 0000000002623ed8 R14: 000000000042fe08 R15: 00007fd51a147000
> [ 714.765942] </TASK>
> [ 714.768132] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc nls_iso8859_1 amd_atl intel_rapl_msr intel_rapl_common amd64_edac ipmi_ssif ee1004 kvm_amd kvm rapl wmi_bmof i2c_piix4 pcspkr acpi_power_meter efi_pstore ipmi_si k10temp i2c_smbus acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dmi_sysfs xfs mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper ghash_clmulni_intel mpt3sas sha1_ssse3 raid_class drm tg3 ccp scsi_transport_sas sp5100_tco wmi dm_mirror dm_region_hash dm_log msr autofs4 aesni_intel
> [ 714.830455] CR2: 0000000000000000
> [ 714.833773] ---[ end trace 0000000000000000 ]---
> [ 714.886371] RIP: 0010:_find_first_bit+0x1d/0x40
> [ 714.890899] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 f0 48 85 f6 74 2d 31 d2 eb 0d 48 83 c2 40 48 83 c7 08 48 39 c2 73 1c <48> 8b 0f 48 85 c9 74 eb f3 48 0f bc c9 48 01 ca 48 39 d0 48 0f 47
> [ 714.909647] RSP: 0018:ffffb9a769b7fdc8 EFLAGS: 00010246
> [ 714.914871] RAX: 0000000000000080 RBX: ffff95e0a54fe000 RCX: 000000000000f7ff
> [ 714.922004] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000000000
> [ 714.929138] RBP: ffffb9a769b7fde0 R08: ffff95e0a54ff670 R09: 00000000000002aa
> [ 714.936271] R10: ffff95ff801b7ec0 R11: 0000000000000086 R12: 0000000000000080
> [ 714.943400] R13: 0000000000000000 R14: ffff95e0a54fe000 R15: ffff95e087e8ac98
> [ 714.950527] FS: 0000000000000000(0000) GS:ffff95ffd53b0000(0000) knlGS:0000000000000000
> [ 714.958613] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 714.964357] CR2: 0000000000000000 CR3: 000000014f670003 CR4: 0000000000770ef0
> [ 714.971490] PKRU: 55555554
> [ 714.974202] note: sev_migrate_tes[16663] exited with irqs disabled
> [ 714.980397] Fixing recursive fault but reboot is needed!
> [ 714.985708] BUG: scheduling while atomic: sev_migrate_tes/16663/0x00000000
> [ 714.992580] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc nls_iso8859_1 amd_atl intel_rapl_msr intel_rapl_common amd64_edac ipmi_ssif ee1004 kvm_amd kvm rapl wmi_bmof i2c_piix4 pcspkr acpi_power_meter efi_pstore ipmi_si k10temp i2c_smbus acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dmi_sysfs xfs mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper ghash_clmulni_intel mpt3sas sha1_ssse3 raid_class drm tg3 ccp scsi_transport_sas sp5100_tco wmi dm_mirror dm_region_hash dm_log msr autofs4 aesni_intel
> [ 715.054914] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Tainted: G D 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 PREEMPT(voluntary)
> [ 715.054918] Tainted: [D]=DIE
> [ 715.054920] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.17.0 12/04/2024
> [ 715.054921] Call Trace:
> [ 715.054922] <TASK>
> [ 715.054923] dump_stack_lvl+0x70/0x90
> [ 715.054928] dump_stack+0x14/0x20
> [ 715.054931] __schedule_bug+0x5a/0x70
> [ 715.054934] __schedule+0xa0d/0xb30
> [ 715.054938] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 715.054941] ? vprintk_default+0x21/0x30
> [ 715.054944] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 715.054946] ? vprintk+0x1c/0x50
> [ 715.054949] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 715.054952] do_task_dead+0x4e/0xa0
> [ 715.054956] make_task_dead+0x146/0x160
> [ 715.054960] rewind_stack_and_make_dead+0x16/0x20
> [ 715.054962] RIP: 0033:0x7fd519efe717
> [ 715.054964] Code: Unable to access opcode bytes at 0x7fd519efe6ed.
> [ 715.054965] RSP: 002b:00007fffbb2193e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [ 715.054967] RAX: ffffffffffffffda RBX: 0000000002623f48 RCX: 00007fd519efe717
> [ 715.054968] RDX: 0000000000420146 RSI: 000000000041f05e RDI: 0000000000000029
> [ 715.054970] RBP: 0000000002622e80 R08: 0000000000000000 R09: 000000000042013e
> [ 715.054971] R10: 00007fd519fb83dd R11: 0000000000000246 R12: 0000000002623ed8
> [ 715.054972] R13: 0000000002623ed8 R14: 000000000042fe08 R15: 00007fd51a147000
> [ 715.054978] </TASK>
> 
> 
> Below is the culprit commit:
> 
> commit d6581b6f2e2622f0fc350020a8e991e8be6b05d8
> Author: Zheyun Shen szy0127@sjtu.edu.cn
> Date: Thu May 22 16:37:32 2025 -0700
> 
> KVM: SVM: Flush cache only on CPUs running SEV guest
> Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com
> 
> The issue goes away if I revert above commit.
> 
> Regards,
> Srikanth Aithal sraithal@amd.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
  2025-07-14 14:12 ` Zheyun Shen
@ 2025-07-14 14:48   ` Sean Christopherson
  2025-07-14 14:56     ` Zheyun Shen
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-07-14 14:48 UTC (permalink / raw)
  To: Zheyun Shen; +Cc: Srikanth Aithal, linux-next, kvm, linux-kernel

On Mon, Jul 14, 2025, Zheyun Shen wrote:
> Hi Aithal,
> I can reproduce this issue in my environment, and I will try to resolve it as
> soon as possible.

Phew, that's good, because I can't repro this, and I don't see anything obviously
wrong.

> > 2025年7月14日 13:21，Aithal, Srikanth <sraithal@amd.com> 写道：
> > 
> > Hello,
> > 
> > While running the kselftest for SEV migration (sev_migrate_tes) on
> > linux-next (6.16.0-rc5-next-20250711, commit a62b7a37e6) on an AMD-based
> > paltforms [Milan,Genoa,Turin], I encountered below kernel crash while
> > running kvm kselftests:
> > 
> > [ 714.008402] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [ 714.015363] #PF: supervisor read access in kernel mode
> > [ 714.020504] #PF: error_code(0x0000) - not-present page
> > [ 714.025643] PGD 11364b067 P4D 11364b067 PUD 12e195067 PMD 0
> > [ 714.031303] Oops: Oops: 0000 [#1] SMP NOPTI
> > [ 714.035487] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Not tainted 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 PREEMPT(voluntary)
> > [ 714.048253] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.17.0 12/04/2024
> > [ 714.055905] RIP: 0010:_find_first_bit+0x1d/0x40

..

> > [ 714.148307] ? sev_writeback_caches+0x25/0x40 [kvm_amd]
> > [ 714.153544] sev_guest_memory_reclaimed+0x34/0x40 [kvm_amd]
> > [ 714.159115] kvm_arch_guest_memory_reclaimed+0x12/0x20 [kvm]
> > [ 714.164817] kvm_mmu_notifier_release+0x3c/0x60 [kvm]
> > [ 714.169896] mmu_notifier_unregister+0x53/0xf0
> > [ 714.174343] kvm_destroy_vm+0x12d/0x2d0 [kvm]
> > [ 714.178727] kvm_vm_stats_release+0x34/0x60 [kvm]
> > [ 714.183459] __fput+0xf2/0x2d0
> > [ 714.186520] fput_close_sync+0x44/0xa0
> > [ 714.190269] __x64_sys_close+0x42/0x80
> > [ 714.194024] x64_sys_call+0x1960/0x2180
> > [ 714.197861] do_syscall_64+0x56/0x1e0
> > [ 714.201530] entry_SYSCALL_64_after_hwframe+0x76/0x7e

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
  2025-07-14 14:48   ` Sean Christopherson
@ 2025-07-14 14:56     ` Zheyun Shen
  2025-07-14 15:14       ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Zheyun Shen @ 2025-07-14 14:56 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Srikanth Aithal, linux-next, kvm, linux-kernel

The problem is triggered by the following codes in tools/testing/selftests/kvm/x86/sev_migrate_tests.c:
static void test_sev_migrate_from(bool es)
{
	struct kvm_vm *src_vm;
	struct kvm_vm *dst_vms[NR_MIGRATE_TEST_VMS];
	int i, ret;

	src_vm = sev_vm_create(es);
	for (i = 0; i < NR_MIGRATE_TEST_VMS; ++i)
		dst_vms[i] = aux_vm_create(true);

	/* Initial migration from the src to the first dst. */
	sev_migrate_from(dst_vms[0], src_vm);

	for (i = 1; i < NR_MIGRATE_TEST_VMS; i++)
		sev_migrate_from(dst_vms[i], dst_vms[i - 1]);

	/* Migrate the guest back to the original VM. */
	ret = __sev_migrate_from(src_vm, dst_vms[NR_MIGRATE_TEST_VMS - 1]);
	TEST_ASSERT(ret == -1 && errno == EIO,
		    "VM that was migrated from should be dead. ret %d, errno: %d", ret,
		    errno);

	kvm_vm_free(src_vm);
	for (i = 0; i < NR_MIGRATE_TEST_VMS; ++i)
		kvm_vm_free(dst_vms[i]);
}

I add some logs in kvm and following shows the result:
[   51.618135] sev guest init kvm:ff177f272432e000                                                                                  
[   51.627235] kvm destory vm kvm:ff177f272432e000                                                                                   
[   51.628011] kvm destory vm mmu notifier unregister kvm:ff177f272432e000                                                          
[   51.642840] kvm destory vm arch destory vm kvm:ff177f272432e000                                                                  
[   51.673612] vm destory x86                                                                                                       
[   51.673957] svm vm destory                                                                                                       
[   51.674401] kvm destory vm kvm:ff177f272432c000                                                                                   
[   51.675152] kvm destory vm mmu notifier unregister kvm:ff177f272432c000                                                          
[   51.675981] kvm destory vm arch destory vm kvm:ff177f272432c000                                                                  
[   51.715937] vm destory x86                                                                                                       
[   51.716289] svm vm destory                                                                                                       
[   51.716754] kvm destory vm kvm:ff177f272432a000                                                                                   
[   51.717530] kvm destory vm mmu notifier unregister kvm:ff177f272432a000                                                          
[   51.718363] kvm destory vm arch destory vm kvm:ff177f272432a000                                                                  
[   51.746672] vm destory x86
[   51.747018] svm vm destory
[   51.747454] kvm destory vm kvm:ff177f2724328000
[   51.748219] kvm destory vm mmu notifier unregister kvm:ff177f2724328000
[   51.749033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   51.749885] #PF: supervisor read access in kernel mode
[   51.750519] #PF: error_code(0x0000) - not-present page

It seems that the cpumask structure is not transferred correctly from ff177f272432e000 to ff177f2724328000.
But unfortunately I’m not familiar with SEV migration. I need to spend some time looking into how SEV 
migration works in order to solve this issue.

Thanks,
Zheyun Shen

> 2025年7月14日 22:48，Sean Christopherson <seanjc@google.com> 写道：
> 
> On Mon, Jul 14, 2025, Zheyun Shen wrote:
>> Hi Aithal,
>> I can reproduce this issue in my environment, and I will try to resolve it as
>> soon as possible.
> 
> Phew, that's good, because I can't repro this, and I don't see anything obviously
> wrong.
> 
>>> 2025年7月14日 13:21，Aithal, Srikanth <sraithal@amd.com> 写道：
>>> 
>>> Hello,
>>> 
>>> While running the kselftest for SEV migration (sev_migrate_tes) on
>>> linux-next (6.16.0-rc5-next-20250711, commit a62b7a37e6) on an AMD-based
>>> paltforms [Milan,Genoa,Turin], I encountered below kernel crash while
>>> running kvm kselftests:
>>> 
>>> [ 714.008402] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>> [ 714.015363] #PF: supervisor read access in kernel mode
>>> [ 714.020504] #PF: error_code(0x0000) - not-present page
>>> [ 714.025643] PGD 11364b067 P4D 11364b067 PUD 12e195067 PMD 0
>>> [ 714.031303] Oops: Oops: 0000 [#1] SMP NOPTI
>>> [ 714.035487] CPU: 14 UID: 0 PID: 16663 Comm: sev_migrate_tes Not tainted 6.16.0-rc5-next-20250711-a62b7a37e6-42f78243e0c #1 PREEMPT(voluntary)
>>> [ 714.048253] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.17.0 12/04/2024
>>> [ 714.055905] RIP: 0010:_find_first_bit+0x1d/0x40
> 
> ..
> 
>>> [ 714.148307] ? sev_writeback_caches+0x25/0x40 [kvm_amd]
>>> [ 714.153544] sev_guest_memory_reclaimed+0x34/0x40 [kvm_amd]
>>> [ 714.159115] kvm_arch_guest_memory_reclaimed+0x12/0x20 [kvm]
>>> [ 714.164817] kvm_mmu_notifier_release+0x3c/0x60 [kvm]
>>> [ 714.169896] mmu_notifier_unregister+0x53/0xf0
>>> [ 714.174343] kvm_destroy_vm+0x12d/0x2d0 [kvm]
>>> [ 714.178727] kvm_vm_stats_release+0x34/0x60 [kvm]
>>> [ 714.183459] __fput+0xf2/0x2d0
>>> [ 714.186520] fput_close_sync+0x44/0xa0
>>> [ 714.190269] __x64_sys_close+0x42/0x80
>>> [ 714.194024] x64_sys_call+0x1960/0x2180
>>> [ 714.197861] do_syscall_64+0x56/0x1e0
>>> [ 714.201530] entry_SYSCALL_64_after_hwframe+0x76/0x7e


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
  2025-07-14 14:56     ` Zheyun Shen
@ 2025-07-14 15:14       ` Sean Christopherson
  2025-07-14 16:50         ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-07-14 15:14 UTC (permalink / raw)
  To: Zheyun Shen; +Cc: Srikanth Aithal, linux-next, kvm, linux-kernel

On Mon, Jul 14, 2025, Zheyun Shen wrote:
> The problem is triggered by the following codes in tools/testing/selftests/kvm/x86/sev_migrate_tests.c:
> static void test_sev_migrate_from(bool es)
> {
> 	struct kvm_vm *src_vm;
> 	struct kvm_vm *dst_vms[NR_MIGRATE_TEST_VMS];
> 	int i, ret;
> 
> 	src_vm = sev_vm_create(es);
> 	for (i = 0; i < NR_MIGRATE_TEST_VMS; ++i)
> 		dst_vms[i] = aux_vm_create(true);
> 
> 	/* Initial migration from the src to the first dst. */
> 	sev_migrate_from(dst_vms[0], src_vm);
> 
> 	for (i = 1; i < NR_MIGRATE_TEST_VMS; i++)
> 		sev_migrate_from(dst_vms[i], dst_vms[i - 1]);
> 
> 	/* Migrate the guest back to the original VM. */
> 	ret = __sev_migrate_from(src_vm, dst_vms[NR_MIGRATE_TEST_VMS - 1]);
> 	TEST_ASSERT(ret == -1 && errno == EIO,
> 		    "VM that was migrated from should be dead. ret %d, errno: %d", ret,
> 		    errno);
> 
> 	kvm_vm_free(src_vm);
> 	for (i = 0; i < NR_MIGRATE_TEST_VMS; ++i)
> 		kvm_vm_free(dst_vms[i]);
> }
> 
> I add some logs in kvm and following shows the result:
> [   51.618135] sev guest init kvm:ff177f272432e000                                                                                  

Argh, I forgot that sev_vm_move_enc_context_from() requires the destination to
*not* be an SEV guest.  KVM needs to explicitly copy over the stack.

> [   51.627235] kvm destory vm kvm:ff177f272432e000                                                                                   
> [   51.628011] kvm destory vm mmu notifier unregister kvm:ff177f272432e000                                                          
> [   51.642840] kvm destory vm arch destory vm kvm:ff177f272432e000                                                                  
> [   51.673612] vm destory x86                                                                                                       
> [   51.673957] svm vm destory                                                                                                       
> [   51.674401] kvm destory vm kvm:ff177f272432c000                                                                                   
> [   51.675152] kvm destory vm mmu notifier unregister kvm:ff177f272432c000                                                          
> [   51.675981] kvm destory vm arch destory vm kvm:ff177f272432c000                                                                  
> [   51.715937] vm destory x86                                                                                                       
> [   51.716289] svm vm destory                                                                                                       
> [   51.716754] kvm destory vm kvm:ff177f272432a000                                                                                   
> [   51.717530] kvm destory vm mmu notifier unregister kvm:ff177f272432a000                                                          
> [   51.718363] kvm destory vm arch destory vm kvm:ff177f272432a000                                                                  
> [   51.746672] vm destory x86
> [   51.747018] svm vm destory
> [   51.747454] kvm destory vm kvm:ff177f2724328000
> [   51.748219] kvm destory vm mmu notifier unregister kvm:ff177f2724328000
> [   51.749033] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [   51.749885] #PF: supervisor read access in kernel mode
> [   51.750519] #PF: error_code(0x0000) - not-present page
> 
> It seems that the cpumask structure is not transferred correctly from
> ff177f272432e000 to ff177f2724328000.  But unfortunately I’m not familiar
> with SEV migration. I need to spend some time looking into how SEV migration
> works in order to solve this issue.

...

> >> I can reproduce this issue in my environment, and I will try to resolve it as
> >> soon as possible.
> > 
> > Phew, that's good, because I can't repro this, and I don't see anything obviously
> > wrong.

/facepalm

-ENOCOFFEE.  I was conflating CONFIG_VMAP_STACK with CONFIG_CPUMASK_OFFSTACK and
thus testing the wrong thing.

I think this is the fix, testing now...

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 95668e84ab86..1476e877b2dc 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1936,6 +1936,7 @@ static void sev_migrate_from(struct kvm *dst_kvm, struct kvm *src_kvm)
        dst->enc_context_owner = src->enc_context_owner;
        dst->es_active = src->es_active;
        dst->vmsa_features = src->vmsa_features;
+       memcpy(&dst->have_run_cpus, &src->have_run_cpus, sizeof(src->have_run_cpus));
 
        src->asid = 0;
        src->active = false;
@@ -1943,6 +1944,7 @@ static void sev_migrate_from(struct kvm *dst_kvm, struct kvm *src_kvm)
        src->pages_locked = 0;
        src->enc_context_owner = NULL;
        src->es_active = false;
+       memset(&src->have_run_cpus, 0, sizeof(src->have_run_cpus));
 
        list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
  2025-07-14 15:14       ` Sean Christopherson
@ 2025-07-14 16:50         ` Sean Christopherson
  2025-07-14 22:17           ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-07-14 16:50 UTC (permalink / raw)
  To: Zheyun Shen; +Cc: Srikanth Aithal, linux-next, kvm, linux-kernel

On Mon, Jul 14, 2025, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Zheyun Shen wrote:
> I think this is the fix, testing now...
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 95668e84ab86..1476e877b2dc 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -1936,6 +1936,7 @@ static void sev_migrate_from(struct kvm *dst_kvm, struct kvm *src_kvm)
>         dst->enc_context_owner = src->enc_context_owner;
>         dst->es_active = src->es_active;
>         dst->vmsa_features = src->vmsa_features;
> +       memcpy(&dst->have_run_cpus, &src->have_run_cpus, sizeof(src->have_run_cpus));
>  
>         src->asid = 0;
>         src->active = false;
> @@ -1943,6 +1944,7 @@ static void sev_migrate_from(struct kvm *dst_kvm, struct kvm *src_kvm)
>         src->pages_locked = 0;
>         src->enc_context_owner = NULL;
>         src->es_active = false;
> +       memset(&src->have_run_cpus, 0, sizeof(src->have_run_cpus));
>  
>         list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list);

Gah, that's niether sufficient nor correct.  I was thinking KVM_VM_DEAD would guard
accesses to the bitmask, but that only handles the KVM_RUN path.  And we don't
want to skip the WBINVD when tearing down the source, because nothing guarantees
the destination has pinned all of the source's memory.

And conversely, I don't think KVM needs to copy over the mask itself.  If a CPU
was used for the source VM but not the destination VM, then it can only have
cached memory that was accessible to the source VM.  And a CPU that was run in
the source is also used by the destination is no different than a CPU that was
run in the destination only.

So as much as I want to avoid allocating another cpumask (ugh), it's the right
thing to do.  And practically speaking, I doubt many real world users of SEV will
be using MAXSMP, i.e. the allocations don't exist anyways.

Unless someone objects and/or has a better idea, I'll squash this:

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 95668e84ab86..e39726d258b8 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2072,6 +2072,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
        if (ret)
                goto out_source_vcpu;
 
+       /*
+        * Allocate a new have_run_cpus for the destination, i.e. don't copy
+        * the set of CPUs from the source.  If a CPU was used to run a vCPU in
+        * the source VM but is never used for the destination VM, then the CPU
+        * can only have cached memory that was accessible to the source VM.
+        */
+       if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
+               ret = -ENOMEM;
+               goto out_source_vcpu;
+       }
+
        sev_migrate_from(kvm, source_kvm);
        kvm_vm_dead(source_kvm);
        cg_cleanup_sev = src_sev;
@@ -2771,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
                goto e_unlock;
        }
 
+       mirror_sev = to_kvm_sev_info(kvm);
+       if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
+               ret = -ENOMEM;
+               goto e_unlock;
+       }
+
        /*
         * The mirror kvm holds an enc_context_owner ref so its asid can't
         * disappear until we're done with it
         */
        source_sev = to_kvm_sev_info(source_kvm);
        kvm_get_kvm(source_kvm);
-       mirror_sev = to_kvm_sev_info(kvm);
        list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
 
        /* Set enc_context_owner and copy its encryption context over */

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
  2025-07-14 16:50         ` Sean Christopherson
@ 2025-07-14 22:17           ` Sean Christopherson
  2025-07-15  6:37             ` Aithal, Srikanth
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-07-14 22:17 UTC (permalink / raw)
  To: Zheyun Shen; +Cc: Srikanth Aithal, linux-next, kvm, linux-kernel

On Mon, Jul 14, 2025, Sean Christopherson wrote:
> So as much as I want to avoid allocating another cpumask (ugh), it's the right
> thing to do.  And practically speaking, I doubt many real world users of SEV will
> be using MAXSMP, i.e. the allocations don't exist anyways.
> 
> Unless someone objects and/or has a better idea, I'll squash this:
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 95668e84ab86..e39726d258b8 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2072,6 +2072,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>         if (ret)
>                 goto out_source_vcpu;
>  
> +       /*
> +        * Allocate a new have_run_cpus for the destination, i.e. don't copy
> +        * the set of CPUs from the source.  If a CPU was used to run a vCPU in
> +        * the source VM but is never used for the destination VM, then the CPU
> +        * can only have cached memory that was accessible to the source VM.
> +        */
> +       if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +               ret = -ENOMEM;
> +               goto out_source_vcpu;
> +       }
> +
>         sev_migrate_from(kvm, source_kvm);
>         kvm_vm_dead(source_kvm);
>         cg_cleanup_sev = src_sev;
> @@ -2771,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>                 goto e_unlock;
>         }
>  
> +       mirror_sev = to_kvm_sev_info(kvm);
> +       if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +               ret = -ENOMEM;
> +               goto e_unlock;
> +       }
> +
>         /*
>          * The mirror kvm holds an enc_context_owner ref so its asid can't
>          * disappear until we're done with it
>          */
>         source_sev = to_kvm_sev_info(source_kvm);
>         kvm_get_kvm(source_kvm);
> -       mirror_sev = to_kvm_sev_info(kvm);
>         list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
>  
>         /* Set enc_context_owner and copy its encryption context over */

This isn't quite right either, because sev_vm_destroy() won't free the cpumask
for mirror VMs.

Aha!  And KVM will also unnecessarily leak have_run_cpus if SNP decomission
fails (though that should be an extremely rare error scecnario).

KVM is guaranteed to have blasted WBINVD before reaching sev_vm_destroy() (see
commit 7e00013bd339 "KVM: SVM: Remove wbinvd in sev_vm_destroy()"), so unless I'm
missing something, KVM can simply free have_run_cpus at the start of sev_vm_destroy().

Ooh, side topic!  The fact that sev_vm_destroy() wasn't blasting WBINVD would
have been a bug if not for kvm_arch_guest_memory_reclaimed() and
kvm_arch_gmem_invalidate() taking care of mirror VMs.

New hash for the patch:

  KVM: SVM: Flush cache only on CPUs running SEV guest
  https://github.com/kvm-x86/linux/commit/6f38f8c57464

And the full contexts of what I force-pushed:

--
From: Zheyun Shen <szy0127@sjtu.edu.cn>
Date: Thu, 22 May 2025 16:37:32 -0700
Subject: [PATCH] KVM: SVM: Flush cache only on CPUs running SEV guest

On AMD CPUs without ensuring cache consistency, each memory page
reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all
CPUs, thereby affecting the performance of other programs on the host.

Typically, an AMD server may have 128 cores or more, while the SEV guest
might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity
to bind these 8 vCPUs to specific physical CPUs.

Therefore, keeping a record of the physical core numbers each time a vCPU
runs can help avoid flushing the cache for all CPUs every time.

Take care to allocate the cpumask used to track which CPUs have run a
vCPU when copying or moving an "encryption context", as nothing guarantees
memory in a mirror VM is a strict subset of the ASID owner, and the
destination VM for intrahost migration needs to maintain it's own set of
CPUs.  E.g. for intrahost migration, if a CPU was used for the source VM
but not the destination VM, then it can only have cached memory that was
accessible to the source VM.  And a CPU that was run in the source is also
used by the destination is no different than a CPU that was run in the
destination only.

Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(),
thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and
kvm_arch_gmem_invalidate() for SEV-SNP.  I.e. it's safe to free the
cpumask prior to unregistering encrypted regions and freeing the ASID.

Cc: Srikanth Aithal <sraithal@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com
Link: https://lore.kernel.org/all/935a82e3-f7ad-47d7-aaaf-f3d2b62ed768@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/sev.c | 71 ++++++++++++++++++++++++++++++++++++------
 arch/x86/kvm/svm/svm.h |  1 +
 2 files changed, 63 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index ed39f8a4d9df..a62cd27a4f45 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -447,7 +447,12 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
 	init_args.probe = false;
 	ret = sev_platform_init(&init_args);
 	if (ret)
-		goto e_free;
+		goto e_free_asid;
+
+	if (!zalloc_cpumask_var(&sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
+		ret = -ENOMEM;
+		goto e_free_asid;
+	}
 
 	/* This needs to happen after SEV/SNP firmware initialization. */
 	if (vm_type == KVM_X86_SNP_VM) {
@@ -465,6 +470,8 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
 	return 0;
 
 e_free:
+	free_cpumask_var(sev->have_run_cpus);
+e_free_asid:
 	argp->error = init_args.error;
 	sev_asid_free(sev);
 	sev->asid = 0;
@@ -709,16 +716,31 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages)
 	}
 }
 
-static void sev_writeback_caches(void)
+static void sev_writeback_caches(struct kvm *kvm)
 {
+	/*
+	 * Note, the caller is responsible for ensuring correctness if the mask
+	 * can be modified, e.g. if a CPU could be doing VMRUN.
+	 */
+	if (cpumask_empty(to_kvm_sev_info(kvm)->have_run_cpus))
+		return;
+
 	/*
 	 * Ensure that all dirty guest tagged cache entries are written back
 	 * before releasing the pages back to the system for use.  CLFLUSH will
 	 * not do this without SME_COHERENT, and flushing many cache lines
 	 * individually is slower than blasting WBINVD for large VMs, so issue
-	 * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported).
+	 * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported)
+	 * on CPUs that have done VMRUN, i.e. may have dirtied data using the
+	 * VM's ASID.
+	 *
+	 * For simplicity, never remove CPUs from the bitmap.  Ideally, KVM
+	 * would clear the mask when flushing caches, but doing so requires
+	 * serializing multiple calls and having responding CPUs (to the IPI)
+	 * mark themselves as still running if they are running (or about to
+	 * run) a vCPU for the VM.
 	 */
-	wbnoinvd_on_all_cpus();
+	wbnoinvd_on_cpus_mask(to_kvm_sev_info(kvm)->have_run_cpus);
 }
 
 static unsigned long get_num_contig_pages(unsigned long idx,
@@ -2046,6 +2068,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	if (ret)
 		goto out_source_vcpu;
 
+	/*
+	 * Allocate a new have_run_cpus for the destination, i.e. don't copy
+	 * the set of CPUs from the source.  If a CPU was used to run a vCPU in
+	 * the source VM but is never used for the destination VM, then the CPU
+	 * can only have cached memory that was accessible to the source VM.
+	 */
+	if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
+		ret = -ENOMEM;
+		goto out_source_vcpu;
+	}
+
 	sev_migrate_from(kvm, source_kvm);
 	kvm_vm_dead(source_kvm);
 	cg_cleanup_sev = src_sev;
@@ -2707,7 +2740,7 @@ int sev_mem_enc_unregister_region(struct kvm *kvm,
 		goto failed;
 	}
 
-	sev_writeback_caches();
+	sev_writeback_caches(kvm);
 
 	__unregister_enc_region_locked(kvm, region);
 
@@ -2749,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 		goto e_unlock;
 	}
 
+	mirror_sev = to_kvm_sev_info(kvm);
+	if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
+		ret = -ENOMEM;
+		goto e_unlock;
+	}
+
 	/*
 	 * The mirror kvm holds an enc_context_owner ref so its asid can't
 	 * disappear until we're done with it
 	 */
 	source_sev = to_kvm_sev_info(source_kvm);
 	kvm_get_kvm(source_kvm);
-	mirror_sev = to_kvm_sev_info(kvm);
 	list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
 
 	/* Set enc_context_owner and copy its encryption context over */
@@ -2817,7 +2855,13 @@ void sev_vm_destroy(struct kvm *kvm)
 
 	WARN_ON(!list_empty(&sev->mirror_vms));
 
-	/* If this is a mirror_kvm release the enc_context_owner and skip sev cleanup */
+	free_cpumask_var(sev->have_run_cpus);
+
+	/*
+	 * If this is a mirror VM, remove it from the owner's list of a mirrors
+	 * and skip ASID cleanup (the ASID is tied to the lifetime of the owner).
+	 * Note, mirror VMs don't support registering encrypted regions.
+	 */
 	if (is_mirroring_enc_context(kvm)) {
 		struct kvm *owner_kvm = sev->enc_context_owner;
 
@@ -3106,7 +3150,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
 	return;
 
 do_sev_writeback_caches:
-	sev_writeback_caches();
+	sev_writeback_caches(vcpu->kvm);
 }
 
 void sev_guest_memory_reclaimed(struct kvm *kvm)
@@ -3119,7 +3163,7 @@ void sev_guest_memory_reclaimed(struct kvm *kvm)
 	if (!sev_guest(kvm) || sev_snp_guest(kvm))
 		return;
 
-	sev_writeback_caches();
+	sev_writeback_caches(kvm);
 }
 
 void sev_free_vcpu(struct kvm_vcpu *vcpu)
@@ -3451,6 +3495,15 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
 	if (sev_es_guest(kvm) && !VALID_PAGE(svm->vmcb->control.vmsa_pa))
 		return -EINVAL;
 
+	/*
+	 * To optimize cache flushes when memory is reclaimed from an SEV VM,
+	 * track physical CPUs that enter the guest for SEV VMs and thus can
+	 * have encrypted, dirty data in the cache, and flush caches only for
+	 * CPUs that have entered the guest.
+	 */
+	if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
+		cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
+
 	/* Assign the asid allocated with this SEV guest */
 	svm->asid = asid;
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e6f3c6a153a0..a7c6f07260cf 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -113,6 +113,7 @@ struct kvm_sev_info {
 	void *guest_req_buf;    /* Bounce buffer for SNP Guest Request input */
 	void *guest_resp_buf;   /* Bounce buffer for SNP Guest Request output */
 	struct mutex guest_req_mutex; /* Must acquire before using bounce buffers */
+	cpumask_var_t have_run_cpus; /* CPUs that have done VMRUN for this VM. */
 };
 
 #define SEV_POLICY_NODBG	BIT_ULL(0)

base-commit: a77896eea33db6fe393d1db1380e2e52f74546a2
--

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform
  2025-07-14 22:17           ` Sean Christopherson
@ 2025-07-15  6:37             ` Aithal, Srikanth
  0 siblings, 0 replies; 8+ messages in thread
From: Aithal, Srikanth @ 2025-07-15  6:37 UTC (permalink / raw)
  To: Sean Christopherson, Zheyun Shen; +Cc: linux-next, kvm, linux-kernel

On 7/15/2025 3:47 AM, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Sean Christopherson wrote:
>> So as much as I want to avoid allocating another cpumask (ugh), it's the right
>> thing to do.  And practically speaking, I doubt many real world users of SEV will
>> be using MAXSMP, i.e. the allocations don't exist anyways.
>>
>> Unless someone objects and/or has a better idea, I'll squash this:
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 95668e84ab86..e39726d258b8 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2072,6 +2072,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>>          if (ret)
>>                  goto out_source_vcpu;
>>   
>> +       /*
>> +        * Allocate a new have_run_cpus for the destination, i.e. don't copy
>> +        * the set of CPUs from the source.  If a CPU was used to run a vCPU in
>> +        * the source VM but is never used for the destination VM, then the CPU
>> +        * can only have cached memory that was accessible to the source VM.
>> +        */
>> +       if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
>> +               ret = -ENOMEM;
>> +               goto out_source_vcpu;
>> +       }
>> +
>>          sev_migrate_from(kvm, source_kvm);
>>          kvm_vm_dead(source_kvm);
>>          cg_cleanup_sev = src_sev;
>> @@ -2771,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>>                  goto e_unlock;
>>          }
>>   
>> +       mirror_sev = to_kvm_sev_info(kvm);
>> +       if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
>> +               ret = -ENOMEM;
>> +               goto e_unlock;
>> +       }
>> +
>>          /*
>>           * The mirror kvm holds an enc_context_owner ref so its asid can't
>>           * disappear until we're done with it
>>           */
>>          source_sev = to_kvm_sev_info(source_kvm);
>>          kvm_get_kvm(source_kvm);
>> -       mirror_sev = to_kvm_sev_info(kvm);
>>          list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
>>   
>>          /* Set enc_context_owner and copy its encryption context over */
> 
> This isn't quite right either, because sev_vm_destroy() won't free the cpumask
> for mirror VMs.
> 
> Aha!  And KVM will also unnecessarily leak have_run_cpus if SNP decomission
> fails (though that should be an extremely rare error scecnario).
> 
> KVM is guaranteed to have blasted WBINVD before reaching sev_vm_destroy() (see
> commit 7e00013bd339 "KVM: SVM: Remove wbinvd in sev_vm_destroy()"), so unless I'm
> missing something, KVM can simply free have_run_cpus at the start of sev_vm_destroy().
> 
> Ooh, side topic!  The fact that sev_vm_destroy() wasn't blasting WBINVD would
> have been a bug if not for kvm_arch_guest_memory_reclaimed() and
> kvm_arch_gmem_invalidate() taking care of mirror VMs.
> 
> New hash for the patch:
> 
>    KVM: SVM: Flush cache only on CPUs running SEV guest
>    https://github.com/kvm-x86/linux/commit/6f38f8c57464


kselftest sev_migrate_tests passes with current 
https://github.com/kvm-x86/linux/tree/next (head 2a046f6), which has 
commit 6f38f8c.


Reported-by: Srikanth Aithal <sraithal@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>


> 
> And the full contexts of what I force-pushed:
> 
> --
> From: Zheyun Shen <szy0127@sjtu.edu.cn>
> Date: Thu, 22 May 2025 16:37:32 -0700
> Subject: [PATCH] KVM: SVM: Flush cache only on CPUs running SEV guest
> 
> On AMD CPUs without ensuring cache consistency, each memory page
> reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all
> CPUs, thereby affecting the performance of other programs on the host.
> 
> Typically, an AMD server may have 128 cores or more, while the SEV guest
> might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity
> to bind these 8 vCPUs to specific physical CPUs.
> 
> Therefore, keeping a record of the physical core numbers each time a vCPU
> runs can help avoid flushing the cache for all CPUs every time.
> 
> Take care to allocate the cpumask used to track which CPUs have run a
> vCPU when copying or moving an "encryption context", as nothing guarantees
> memory in a mirror VM is a strict subset of the ASID owner, and the
> destination VM for intrahost migration needs to maintain it's own set of
> CPUs.  E.g. for intrahost migration, if a CPU was used for the source VM
> but not the destination VM, then it can only have cached memory that was
> accessible to the source VM.  And a CPU that was run in the source is also
> used by the destination is no different than a CPU that was run in the
> destination only.
> 
> Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(),
> thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and
> kvm_arch_gmem_invalidate() for SEV-SNP.  I.e. it's safe to free the
> cpumask prior to unregistering encrypted regions and freeing the ASID.
> 
> Cc: Srikanth Aithal <sraithal@amd.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com
> Link: https://lore.kernel.org/all/935a82e3-f7ad-47d7-aaaf-f3d2b62ed768@amd.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/sev.c | 71 ++++++++++++++++++++++++++++++++++++------
>   arch/x86/kvm/svm/svm.h |  1 +
>   2 files changed, 63 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index ed39f8a4d9df..a62cd27a4f45 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -447,7 +447,12 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
>   	init_args.probe = false;
>   	ret = sev_platform_init(&init_args);
>   	if (ret)
> -		goto e_free;
> +		goto e_free_asid;
> +
> +	if (!zalloc_cpumask_var(&sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +		ret = -ENOMEM;
> +		goto e_free_asid;
> +	}
>   
>   	/* This needs to happen after SEV/SNP firmware initialization. */
>   	if (vm_type == KVM_X86_SNP_VM) {
> @@ -465,6 +470,8 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
>   	return 0;
>   
>   e_free:
> +	free_cpumask_var(sev->have_run_cpus);
> +e_free_asid:
>   	argp->error = init_args.error;
>   	sev_asid_free(sev);
>   	sev->asid = 0;
> @@ -709,16 +716,31 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages)
>   	}
>   }
>   
> -static void sev_writeback_caches(void)
> +static void sev_writeback_caches(struct kvm *kvm)
>   {
> +	/*
> +	 * Note, the caller is responsible for ensuring correctness if the mask
> +	 * can be modified, e.g. if a CPU could be doing VMRUN.
> +	 */
> +	if (cpumask_empty(to_kvm_sev_info(kvm)->have_run_cpus))
> +		return;
> +
>   	/*
>   	 * Ensure that all dirty guest tagged cache entries are written back
>   	 * before releasing the pages back to the system for use.  CLFLUSH will
>   	 * not do this without SME_COHERENT, and flushing many cache lines
>   	 * individually is slower than blasting WBINVD for large VMs, so issue
> -	 * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported).
> +	 * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported)
> +	 * on CPUs that have done VMRUN, i.e. may have dirtied data using the
> +	 * VM's ASID.
> +	 *
> +	 * For simplicity, never remove CPUs from the bitmap.  Ideally, KVM
> +	 * would clear the mask when flushing caches, but doing so requires
> +	 * serializing multiple calls and having responding CPUs (to the IPI)
> +	 * mark themselves as still running if they are running (or about to
> +	 * run) a vCPU for the VM.
>   	 */
> -	wbnoinvd_on_all_cpus();
> +	wbnoinvd_on_cpus_mask(to_kvm_sev_info(kvm)->have_run_cpus);
>   }
>   
>   static unsigned long get_num_contig_pages(unsigned long idx,
> @@ -2046,6 +2068,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>   	if (ret)
>   		goto out_source_vcpu;
>   
> +	/*
> +	 * Allocate a new have_run_cpus for the destination, i.e. don't copy
> +	 * the set of CPUs from the source.  If a CPU was used to run a vCPU in
> +	 * the source VM but is never used for the destination VM, then the CPU
> +	 * can only have cached memory that was accessible to the source VM.
> +	 */
> +	if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +		ret = -ENOMEM;
> +		goto out_source_vcpu;
> +	}
> +
>   	sev_migrate_from(kvm, source_kvm);
>   	kvm_vm_dead(source_kvm);
>   	cg_cleanup_sev = src_sev;
> @@ -2707,7 +2740,7 @@ int sev_mem_enc_unregister_region(struct kvm *kvm,
>   		goto failed;
>   	}
>   
> -	sev_writeback_caches();
> +	sev_writeback_caches(kvm);
>   
>   	__unregister_enc_region_locked(kvm, region);
>   
> @@ -2749,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>   		goto e_unlock;
>   	}
>   
> +	mirror_sev = to_kvm_sev_info(kvm);
> +	if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +		ret = -ENOMEM;
> +		goto e_unlock;
> +	}
> +
>   	/*
>   	 * The mirror kvm holds an enc_context_owner ref so its asid can't
>   	 * disappear until we're done with it
>   	 */
>   	source_sev = to_kvm_sev_info(source_kvm);
>   	kvm_get_kvm(source_kvm);
> -	mirror_sev = to_kvm_sev_info(kvm);
>   	list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
>   
>   	/* Set enc_context_owner and copy its encryption context over */
> @@ -2817,7 +2855,13 @@ void sev_vm_destroy(struct kvm *kvm)
>   
>   	WARN_ON(!list_empty(&sev->mirror_vms));
>   
> -	/* If this is a mirror_kvm release the enc_context_owner and skip sev cleanup */
> +	free_cpumask_var(sev->have_run_cpus);
> +
> +	/*
> +	 * If this is a mirror VM, remove it from the owner's list of a mirrors
> +	 * and skip ASID cleanup (the ASID is tied to the lifetime of the owner).
> +	 * Note, mirror VMs don't support registering encrypted regions.
> +	 */
>   	if (is_mirroring_enc_context(kvm)) {
>   		struct kvm *owner_kvm = sev->enc_context_owner;
>   
> @@ -3106,7 +3150,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>   	return;
>   
>   do_sev_writeback_caches:
> -	sev_writeback_caches();
> +	sev_writeback_caches(vcpu->kvm);
>   }
>   
>   void sev_guest_memory_reclaimed(struct kvm *kvm)
> @@ -3119,7 +3163,7 @@ void sev_guest_memory_reclaimed(struct kvm *kvm)
>   	if (!sev_guest(kvm) || sev_snp_guest(kvm))
>   		return;
>   
> -	sev_writeback_caches();
> +	sev_writeback_caches(kvm);
>   }
>   
>   void sev_free_vcpu(struct kvm_vcpu *vcpu)
> @@ -3451,6 +3495,15 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
>   	if (sev_es_guest(kvm) && !VALID_PAGE(svm->vmcb->control.vmsa_pa))
>   		return -EINVAL;
>   
> +	/*
> +	 * To optimize cache flushes when memory is reclaimed from an SEV VM,
> +	 * track physical CPUs that enter the guest for SEV VMs and thus can
> +	 * have encrypted, dirty data in the cache, and flush caches only for
> +	 * CPUs that have entered the guest.
> +	 */
> +	if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
> +		cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
> +
>   	/* Assign the asid allocated with this SEV guest */
>   	svm->asid = asid;
>   
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index e6f3c6a153a0..a7c6f07260cf 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -113,6 +113,7 @@ struct kvm_sev_info {
>   	void *guest_req_buf;    /* Bounce buffer for SNP Guest Request input */
>   	void *guest_resp_buf;   /* Bounce buffer for SNP Guest Request output */
>   	struct mutex guest_req_mutex; /* Must acquire before using bounce buffers */
> +	cpumask_var_t have_run_cpus; /* CPUs that have done VMRUN for this VM. */
>   };
>   
>   #define SEV_POLICY_NODBG	BIT_ULL(0)
> 
> base-commit: a77896eea33db6fe393d1db1380e2e52f74546a2
> --


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-07-15  6:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14  5:21 [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform Aithal, Srikanth
2025-07-14 14:12 ` Zheyun Shen
2025-07-14 14:48   ` Sean Christopherson
2025-07-14 14:56     ` Zheyun Shen
2025-07-14 15:14       ` Sean Christopherson
2025-07-14 16:50         ` Sean Christopherson
2025-07-14 22:17           ` Sean Christopherson
2025-07-15  6:37             ` Aithal, Srikanth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).