public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
From: Yazen Ghannam <yazen.ghannam@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org, "Tony Luck" <tony.luck@intel.com>,
	"Mario Limonciello" <mario.limonciello@amd.com>,
	"Bjorn Helgaas" <bhelgaas@google.com>,
	"Jean Delvare" <jdelvare@suse.com>,
	"Guenter Roeck" <linux@roeck-us.net>,
	"Clemens Ladisch" <clemens@ladisch.de>,
	"Shyam Sundar S K" <Shyam-sundar.S-k@amd.com>,
	"Hans de Goede" <hdegoede@redhat.com>,
	"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>,
	"Naveen Krishna Chatradhi" <naveenkrishna.chatradhi@amd.com>,
	"Suma Hegde" <suma.hegde@amd.com>,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-hwmon@vger.kernel.org,
	platform-driver-x86@vger.kernel.org
Subject: Re: [PATCH v2 00/16] AMD NB and SMN rework
Date: Mon, 6 Jan 2025 11:31:04 -0500	[thread overview]
Message-ID: <20250106163104.GA664169@yaz-khff2.amd.com> (raw)
In-Reply-To: <20250106153839.GA631754@yaz-khff2.amd.com>

On Mon, Jan 06, 2025 at 10:38:45AM -0500, Yazen Ghannam wrote:
> On Fri, Jan 03, 2025 at 10:49:25PM +0100, Borislav Petkov wrote:
> > On Fri, Dec 06, 2024 at 04:11:53PM +0000, Yazen Ghannam wrote:
> > > Hi all,
> > > 
> > > The theme of this set is decoupling the "AMD node" concept from the
> > > legacy northbridge support.
> > > 
> > > Additionally, AMD System Management Network (SMN) access code is
> > > decoupled and expanded too.
> > > 
> > > Patches 1-3 begin reducing the scope of AMD_NB.
> > > 
> > > Patches 4-9 begin moving generic AMD node support out of AMD_NB.
> > > 
> > > Patches 10-13 move SMN support out of AMD_NB and do some refactoring.
> > > 
> > > Patch 14 has HSMP reuse SMN functionality.
> > > 
> > > Patches 15-16 address userspace access to SMN.
> > 
> > So I took the first patch and then booting the first 13 with the intention to
> > queue them while the remaining three are still being discussed, is causing the
> > below in my guest.
> > 
> > .config is attached, I've pushed the branch here too, if you wanna test with
> > it:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-x86-misc
> > 
> > [    0.897060] cirrus 0000:00:01.0: [drm] fb0: cirrusdrmfb frame buffer device
> > [    0.900310] BUG: kernel NULL pointer dereference, address: 00000000000000c4
> > [    0.902551] #PF: supervisor read access in kernel mode
> > [    0.904096] #PF: error_code(0x0000) - not-present page
> > [    0.904268] PGD 0 P4D 0 
> > [    0.904268] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [    0.904268] CPU: 0 UID: 0 PID: 20 Comm: cpuhp/0 Not tainted 6.13.0-rc1+ #1
> > [    0.904268] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-8 02/21/2024
> > [    0.904268] RIP: 0010:pci_read_config_dword+0x9/0x40
> > [    0.904268] Code: 00 00 e9 8a f9 57 00 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 <8b> 87 c4 00 00 00 48 89 d1 83 f8 03 74 10 8b 47 38 48 8b 7f 10 89
> > [    0.904268] RSP: 0018:ffffc9000012fcd8 EFLAGS: 00010246
> > [    0.904268] RAX: 0000000000000000 RBX: ffff88800d296640 RCX: 000000000000003f
> > [    0.904268] RDX: ffffc9000012fce4 RSI: 00000000000001c4 RDI: 0000000000000000
> > [    0.904268] RBP: ffffc9000012fd60 R08: 0000000000000040 R09: 0000000000000010
> > [    0.904268] R10: ffff88800daa1eb0 R11: fffffffffff8dc6f R12: 0000000040000163
> > [    0.904268] R13: ffffc9000012fd60 R14: 0000000000000000 R15: ffff88807d62fc90
> > [    0.904268] FS:  0000000000000000(0000) GS:ffff88807d600000(0000) knlGS:0000000000000000
> > [    0.904268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.904268] CR2: 00000000000000c4 CR3: 0000000002c1a000 CR4: 00000000003506f0
> > [    0.904268] Call Trace:
> > [    0.904268]  <TASK>
> > [    0.904268]  ? __die+0x31/0x80
> > [    0.904268]  ? page_fault_oops+0x15d/0x4f0
> > [    0.904268]  ? srso_return_thunk+0x5/0x5f
> > [    0.904268]  ? ttwu_queue_wakelist+0xf7/0x100
> > [    0.904268]  ? exc_page_fault+0x78/0x150
> > [    0.904268]  ? asm_exc_page_fault+0x26/0x30
> > [    0.904268]  ? pci_read_config_dword+0x9/0x40
> > [    0.904268]  ? srso_return_thunk+0x5/0x5f
> > [    0.904268]  amd_init_l3_cache.part.0+0x6a/0x110
> > [    0.904268]  cpuid4_cache_lookup_regs+0xcf/0x2a0
> > [    0.904268]  populate_cache_leaves+0x6f/0x530
> > [    0.904268]  ? srso_return_thunk+0x5/0x5f
> > [    0.904268]  ? dl_server_stop+0x2f/0x40
> > [    0.904268]  ? srso_return_thunk+0x5/0x5f
> > [    0.904268]  detect_cache_attributes+0x97/0x330
> > [    0.904268]  ? __pfx_cacheinfo_cpu_online+0x10/0x10
> > [    0.904268]  cacheinfo_cpu_online+0x22/0x250
> > [    0.904268]  ? srso_return_thunk+0x5/0x5f
> > [    0.904268]  ? __pfx_cacheinfo_cpu_online+0x10/0x10
> > [    0.904268]  cpuhp_invoke_callback+0x10f/0x480
> > [    0.904268]  ? try_to_wake_up+0x23b/0x540
> > [    0.904268]  cpuhp_thread_fun+0xd4/0x160
> > [    0.904268]  smpboot_thread_fn+0xdd/0x1f0
> > [    0.904268]  ? __pfx_smpboot_thread_fn+0x10/0x10
> > [    0.904268]  kthread+0xca/0xf0
> > [    0.904268]  ? __pfx_kthread+0x10/0x10
> > [    0.904268]  ret_from_fork+0x50/0x60
> > [    0.904268]  ? __pfx_kthread+0x10/0x10
> > [    0.904268]  ret_from_fork_asm+0x1a/0x30
> > [    0.904268]  </TASK>
> > [    0.904268] Modules linked in:
> > [    0.904268] CR2: 00000000000000c4
> > [    0.904268] ---[ end trace 0000000000000000 ]---
> > [    0.904268] RIP: 0010:pci_read_config_dword+0x9/0x40
> > [    0.904268] Code: 00 00 e9 8a f9 57 00 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 <8b> 87 c4 00 00 00 48 89 d1 83 f8 03 74 10 8b 47 38 48 8b 7f 10 89
> > [    0.988792] RSP: 0018:ffffc9000012fcd8 EFLAGS: 00010246
> > [    0.988792] RAX: 0000000000000000 RBX: ffff88800d296640 RCX: 000000000000003f
> > [    0.988792] RDX: ffffc9000012fce4 RSI: 00000000000001c4 RDI: 0000000000000000
> > [    0.988792] RBP: ffffc9000012fd60 R08: 0000000000000040 R09: 0000000000000010
> > [    0.992761] R10: ffff88800daa1eb0 R11: fffffffffff8dc6f R12: 0000000040000163
> > [    0.992761] R13: ffffc9000012fd60 R14: 0000000000000000 R15: ffff88807d62fc90
> > [    0.992761] FS:  0000000000000000(0000) GS:ffff88807d600000(0000) knlGS:0000000000000000
> > [    0.996772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.996772] CR2: 00000000000000c4 CR3: 0000000002c1a000 CR4: 00000000003506f0
> > [    0.996772] note: cpuhp/0[20] exited with irqs disabled
> > [    1.680874] tsc: Refined TSC clocksource calibration: 3700.028 MHz
> > [    1.683128] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6aaae08e541, max_idle_ns: 881590514464 ns
> > [    1.688137] clocksource: Switched to clocksource tsc
> > 
> > 
> 
> Can you please share the guest parameters?
> 

I was able to reproduce it. The patch below seems to fix the issue.

There's a comment in the function that this code is not for virtualized
environments. Also, the "L3 in Northbridge" design doesn't apply to Zen
systems.

I'll keep looking at this to get a better understanding. My first
thought is that this was silently handled before, because the AMD_NB
code operated on PCI IDs. And these wouldn't be exposed to guests, so
the northbridge data structures wouldn't be initialized.

Specifically, I think we now have a non-zero number of northbridges,
since using the topology info rather than counting PCI devices.

In any case, I think it's better to have explicit checks.

Thanks,
Yazen

diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index 392d09c936d6..93d993a6a1df 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -595,6 +595,12 @@ static void amd_init_l3_cache(struct _cpuid4_info_regs *this_leaf, int index)
 	if (index < 3)
 		return;
 
+	if (cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+		return;
+
+	if (cpu_feature_enabled(X86_FEATURE_ZEN))
+		return;
+
 	node = topology_amd_node_id(smp_processor_id());
 	this_leaf->nb = node_to_amd_nb(node);
 	if (this_leaf->nb && !this_leaf->nb->l3_cache.indices)

      reply	other threads:[~2025-01-06 16:31 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-06 16:11 [PATCH v2 00/16] AMD NB and SMN rework Yazen Ghannam
2024-12-06 16:11 ` [PATCH v2 01/16] x86/mce/amd: Remove shared threshold bank plumbing Yazen Ghannam
2024-12-06 16:11 ` [PATCH v2 02/16] x86/amd_nb: Restrict init function to AMD-based systems Yazen Ghannam
2024-12-06 16:11 ` [PATCH v2 03/16] x86/amd_nb: Clean up early_is_amd_nb() Yazen Ghannam
2024-12-06 16:11 ` [PATCH v2 04/16] x86: Start moving AMD Node functionality out of AMD_NB Yazen Ghannam
2024-12-06 16:11 ` [PATCH v2 05/16] x86/amd_nb: Simplify function 4 search Yazen Ghannam
2024-12-06 16:11 ` [PATCH v2 06/16] x86/amd_nb: Simplify root device search Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 07/16] x86/amd_nb: Use topology info to get AMD node count Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 08/16] x86/amd_nb: Simplify function 3 search Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 09/16] x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id() Yazen Ghannam
2024-12-06 16:38   ` Guenter Roeck
2024-12-06 16:12 ` [PATCH v2 10/16] x86/amd_nb: Move SMN access code to a new amd_node driver Yazen Ghannam
2024-12-09 13:35   ` Ilpo Järvinen
2024-12-06 16:12 ` [PATCH v2 11/16] x86/amd_node: Update __amd_smn_rw() error paths Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 12/16] x86/amd_node: Remove dependency on AMD_NB Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 13/16] x86/amd_node: Use defines for SMN register offsets Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 14/16] x86/amd_node, platform/x86/amd/hsmp: Have HSMP use SMN through AMD_NODE Yazen Ghannam
2024-12-09 13:32   ` Ilpo Järvinen
2024-12-11 16:13     ` Yazen Ghannam
2024-12-12 17:27   ` [PATCH v2.1] " Yazen Ghannam
2024-12-12 18:50     ` Ilpo Järvinen
2024-12-12 21:46       ` Yazen Ghannam
2024-12-16 13:57         ` Ilpo Järvinen
2024-12-13 15:22     ` [PATCH v2.2] " Yazen Ghannam
2024-12-14 10:05       ` Borislav Petkov
2024-12-16 18:33         ` Yazen Ghannam
2024-12-24  1:14           ` Suma Hegde
2025-01-02 20:04             ` Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 15/16] x86/amd_node: Add SMN offsets to exclusive region access Yazen Ghannam
2024-12-06 16:12 ` [PATCH v2 16/16] x86/amd_node: Add support for debugfs access to SMN registers Yazen Ghannam
2025-01-03 21:49 ` [PATCH v2 00/16] AMD NB and SMN rework Borislav Petkov
2025-01-06 15:38   ` Yazen Ghannam
2025-01-06 16:31     ` Yazen Ghannam [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250106163104.GA664169@yaz-khff2.amd.com \
    --to=yazen.ghannam@amd.com \
    --cc=Shyam-sundar.S-k@amd.com \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=clemens@ladisch.de \
    --cc=hdegoede@redhat.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=jdelvare@suse.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux@roeck-us.net \
    --cc=mario.limonciello@amd.com \
    --cc=naveenkrishna.chatradhi@amd.com \
    --cc=platform-driver-x86@vger.kernel.org \
    --cc=suma.hegde@amd.com \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox